Recover peers using history from Lucene #44853

DaveCTurner · 2019-07-25T12:10:10Z

Thanks to peer recovery retention leases we now retain the history needed to
perform peer recoveries from the index instead of from the translog. This
commit adjusts the peer recovery process to do so, and also adjusts it to use
the existence of a retention lease to decide whether or not to attempt an
operations-based recovery.

Reverts #38904 and #42211
Relates #41536

Thanks to peer recovery retention leases we now retain the history needed to perform peer recoveries from the index instead of from the translog. This commit adjusts the peer recovery process to do so, and also adjusts it to use the existence of a retention lease to decide whether or not to attempt an operations-based recovery. Reverts elastic#38904 and elastic#42211 Relates elastic#41536

elasticmachine · 2019-07-25T12:10:13Z

Pinging @elastic/es-distributed

DaveCTurner

Marked as WIP as there's a few small points to discuss still (highlighted below).

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

DaveCTurner · 2019-07-25T12:13:09Z

server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

-        assertThat(shard.refreshStats().getTotal(), equalTo(2L)); // refresh on: finalize and end of recovery
+        // refresh on: finalize and end of recovery
+        // finalizing a replica involves two refreshes with soft deletes because of estimateNumberOfHistoryOperations()
+        final long initialRefreshes = shard.routingEntry().primary() || shard.indexSettings().isSoftDeleteEnabled() == false ? 2L : 3L;


I'm not sure if this is expected or not. Seems a bit awkward. Maybe there's a better solution?

I think it's okay to initialize initialRefreshes from shard.refreshStats().getTotal() and remove this assertion.

DaveCTurner · 2019-07-25T12:13:17Z

server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

-        long externalRefreshCount = shard.refreshStats().getExternalTotal();
-
+        final long externalRefreshCount = shard.refreshStats().getExternalTotal();
+        final long extraInternalRefreshes = shard.routingEntry().primary() || shard.indexSettings().isSoftDeleteEnabled() == false ? 0 : 1;


I'm not sure if this is expected or not. Seems a bit awkward. Maybe there's a better solution?

…tegrate-recovery

dnhatn

I did an initial pass and left some comments. This looks good.

dnhatn · 2019-07-26T02:50:31Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

@@ -255,6 +291,11 @@ public void recoverToTarget(ActionListener<RecoveryResponse> listener) {
            }, onFailure);

            establishRetentionLeaseStep.whenComplete(r -> {
+                if (useRetentionLeases) {


Maybe just always close once at the end as here's a noop.

It's not a no-op if we are doing a file-based recovery. In that case we have a proper retention lock at this point and should close it now so that we can start discarding history during phase 2. I clarified the condition to useRetentionLeases && isSequenceNumberBasedRecovery == false in 190649d.

dnhatn · 2019-07-26T02:51:53Z

server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

-        assertThat(shard.refreshStats().getTotal(), equalTo(2L)); // refresh on: finalize and end of recovery
+        // refresh on: finalize and end of recovery
+        // finalizing a replica involves two refreshes with soft deletes because of estimateNumberOfHistoryOperations()
+        final long initialRefreshes = shard.routingEntry().primary() || shard.indexSettings().isSoftDeleteEnabled() == false ? 2L : 3L;


I think it's okay to initialize initialRefreshes from shard.refreshStats().getTotal() and remove this assertion.

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

dnhatn · 2019-07-26T02:58:01Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

+                // temporarily prevent any history from being discarded, and do this before acquiring the safe commit so that we can
+                // be certain that all operations after the safe commit's local checkpoint will be retained for the duration of this
+                // recovery.
+                retentionLock = shard.acquireRetentionLock();


We need to acquire the retention lock before calling hasCompleteHistoryOperations.

Oof good catch, thanks. Addressed in 190649d.

dnhatn · 2019-07-26T02:58:20Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

+            final boolean isSequenceNumberBasedRecovery
+                = request.startingSeqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO
+                && isTargetSameHistory()
+                && shard.hasCompleteHistoryOperations("peer-recovery", request.startingSeqNo())


And if we rely on "peer recovery leases" to retain the history (when soft-deletes enabled), we might not need to check hasCompleteHistoryOperations.

dnhatn · 2019-07-26T03:03:14Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

@@ -158,13 +164,32 @@ public void recoverToTarget(ActionListener<RecoveryResponse> listener) {
                    throw new DelayRecoveryException("source node does not have the shard listed in its state as allocated on the node");
                }
                assert targetShardRouting.initializing() : "expected recovery target to be initializing but was " + targetShardRouting;
+                retentionLeaseRef.set(useRetentionLeases ? shard.getRetentionLeases().get(


Is there any issue if we use any existing retention leases for the peer recovery purpose? I mean to rely on hasCompleteHistoryOperations.

Yes, the issue is that we do not guarantee that the primary retains every operation required by every retention lease, so we must use hasCompleteHistoryOperations to ensure this. The problem is that in a file-based recovery we create a retention lease at the local checkpoint of this shard's safe commit, which may be behind every other lease, so we cannot be certain that every other peer is also able to respect this lease; if this primary were to fail then another primary may be elected without all the history needed for all its leases.

See e.g. this comment: https://github.com/elastic/elasticsearch/pull/44853/files#diff-9feb4947ba4df0b486c7de6b9d46502fR215-R219

On reflection, it might actually be possible to do this: we'd need to keep the retention lock open until at least having replayed history to the global checkpoint. I'll think about this a bit more.

Ok, Yannick and I discussed this and you're right, there's no real drawback to creating the new replica's retention lease according to the current (persisted) global checkpoint rather than the local checkpoint of the safe commit. This means that we've much greater chance that this retention lease is satisfied on every in-sync shard copy, at which point we should not obviously need to call hasCompleteHistoryOperations in many cases.

It is, however, not totally watertight, because today we sometimes create unsatisfied leases for BWC reasons. I haven't gone through the details to see if there are other cases too, but given that it's not something we're asserting today I have concerns that it might be possible. Also it's cheap to call hasCompleteHistoryOperations and will save us from disaster, so let's keep it in.

I've adjusted the creation logic in 5fb8bda.

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

DaveCTurner

Thanks @dnhatn, comments addressed or responded.

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

DaveCTurner · 2019-07-27T06:33:45Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

@@ -158,13 +164,32 @@ public void recoverToTarget(ActionListener<RecoveryResponse> listener) {
                    throw new DelayRecoveryException("source node does not have the shard listed in its state as allocated on the node");
                }
                assert targetShardRouting.initializing() : "expected recovery target to be initializing but was " + targetShardRouting;
+                retentionLeaseRef.set(useRetentionLeases ? shard.getRetentionLeases().get(


Yes, the issue is that we do not guarantee that the primary retains every operation required by every retention lease, so we must use hasCompleteHistoryOperations to ensure this. The problem is that in a file-based recovery we create a retention lease at the local checkpoint of this shard's safe commit, which may be behind every other lease, so we cannot be certain that every other peer is also able to respect this lease; if this primary were to fail then another primary may be elected without all the history needed for all its leases.

DaveCTurner · 2019-07-27T06:34:32Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

+                // temporarily prevent any history from being discarded, and do this before acquiring the safe commit so that we can
+                // be certain that all operations after the safe commit's local checkpoint will be retained for the duration of this
+                // recovery.
+                retentionLock = shard.acquireRetentionLock();


Oof good catch, thanks. Addressed in 190649d.

DaveCTurner · 2019-07-27T06:36:27Z

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

@@ -255,6 +291,11 @@ public void recoverToTarget(ActionListener<RecoveryResponse> listener) {
            }, onFailure);

            establishRetentionLeaseStep.whenComplete(r -> {
+                if (useRetentionLeases) {


It's not a no-op if we are doing a file-based recovery. In that case we have a proper retention lock at this point and should close it now so that we can start discarding history during phase 2. I clarified the condition to useRetentionLeases && isSequenceNumberBasedRecovery == false in 190649d.

…tegrate-recovery

By cloning the primary's lease we need not worry about creating a lease that retains history which has already been discarded.

…tegrate-recovery

ywelsch

I've left one question. Looking good o.w.

server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

…tegrate-recovery

dnhatn

LGTM. Thanks @DaveCTurner.

ywelsch

LGTM

DaveCTurner · 2019-07-30T16:27:05Z

The test failure is meaningful: https://scans.gradle.com/s/bcfrvej2xrypy/console-log?task=:x-pack:plugin:ccr:internalClusterTest

What happened is that there was ongoing indexing while a replica was recovering, which plays out like this:

we capture the GCP of the primary and use that in the cleanFiles() step when creating the new translog.
the GCP of the primary advances, as does its retention lease.
we clone the primary's (newly-advanced) retention lease for the replica.
the replica's GCP does not advance.

The replica's retention lease is now ahead of its GCP, and this trips an assertion at renewal time. Yannick and I discussed a couple of options:

weaken the assertion to ignore cases where the shard copy is not in sync. This makes it quite a bit weaker, but I can't see any terrible consequences of this.
create the retention lease slightly earlier, within phase 1, and base the GCP of the replica's new translog on the retained seqno of this new retention lease.

@dnhatn WDYT?

test failure will require some rework

dnhatn · 2019-07-30T17:24:07Z

@DaveCTurner Thanks for the ping. Both options are good to me.

…tegrate-recovery

DaveCTurner · 2019-07-31T14:53:37Z

@elasticmachine please run elasticsearch-ci/docs (failure looks unrelated and bogus)

…tegrate-recovery

DaveCTurner · 2019-08-01T06:20:19Z

@elasticmachine please run elasticsearch-ci/1

DaveCTurner · 2019-08-01T07:15:15Z

@elasticmachine please run elasticsearch-ci/1

DaveCTurner · 2019-08-01T08:20:00Z

Ok I'm happy with this and it's passed a bunch of runs through :server:test and :server:integtest overnight. Worth a final pass.

Changes since the last reviews:

d63e777 to go back to today's behaviour of using the GCP of the primary as the starting GCP of the replica, but adjusting this to be sampled after copying the files over and cloning the primary's lease so we can be sure that it's ahead of the leased checkpoint.
f47e56e to also create the lease when phase 1 is a no-op thanks to a synced flush marker (caught by BWC tests, but I added a proper test for it too).

ywelsch

LGTM

Thanks to peer recovery retention leases we now retain the history needed to perform peer recoveries from the index instead of from the translog. This commit adjusts the peer recovery process to do so, and also adjusts it to use the existence of a retention lease to decide whether or not to attempt an operations-based recovery. Reverts #38904 and #42211 Relates #41536

DaveCTurner added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jul 25, 2019

DaveCTurner requested review from ywelsch and dnhatn July 25, 2019 12:10

DaveCTurner added the WIP label Jul 25, 2019

DaveCTurner commented Jul 25, 2019

View reviewed changes

Remove unnecessary extra logging

4d53ec9

DaveCTurner mentioned this pull request Jul 25, 2019

Retain history for peer recovery using leases #41536

Closed

10 tasks

DaveCTurner added 2 commits July 25, 2019 13:25

Precommit

56c2466

Merge branch 'peer-recovery-retention-leases' into 2019-07-25-prrl-in…

ec7c423

…tegrate-recovery

dnhatn reviewed Jul 26, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java Show resolved Hide resolved

DaveCTurner added 2 commits July 27, 2019 06:56

Obtain retention lock earlier

190649d

TBD discussed

7222857

DaveCTurner commented Jul 27, 2019

View reviewed changes

DaveCTurner added 6 commits July 29, 2019 08:49

Close retention lock while establishing lease

94dd36b

Merge branch 'peer-recovery-retention-leases' into 2019-07-25-prrl-in…

e9f8ff6

…tegrate-recovery

Create new PRRL using global checkpoint

5fb8bda

By cloning the primary's lease we need not worry about creating a lease that retains history which has already been discarded.

Merge branch 'peer-recovery-retention-leases' into 2019-07-25-prrl-in…

700e34e

…tegrate-recovery

orly

9e10171

Merge branch 'peer-recovery-retention-leases' into 2019-07-25-prrl-in…

63a6e22

…tegrate-recovery

DaveCTurner requested a review from dnhatn July 30, 2019 09:15

ywelsch reviewed Jul 30, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java Show resolved Hide resolved

server/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java Show resolved Hide resolved

DaveCTurner added 2 commits July 30, 2019 15:28

Merge branch 'peer-recovery-retention-leases' into 2019-07-25-prrl-in…

0dcfb34

…tegrate-recovery

Assert primary mode

bc5149a

dnhatn previously approved these changes Jul 30, 2019

View reviewed changes

ywelsch previously approved these changes Jul 30, 2019

View reviewed changes

DaveCTurner requested a review from dnhatn July 30, 2019 16:33

DaveCTurner added 6 commits July 31, 2019 08:12

Merge branch 'peer-recovery-retention-leases' into 2019-07-25-prrl-in…

8e00661

…tegrate-recovery

Base initial GCP on the cloned retention lease

14fde1c

Imports

825fc8f

Ensure that commit remains safe

24b5806

Sample the GCP later and merely assert that it's ahead of the leased GCP

d63e777

Handle the synced-flush case which skips most of phase 1

f47e56e

DaveCTurner added 2 commits July 31, 2019 17:35

Merge branch 'peer-recovery-retention-leases' into 2019-07-25-prrl-in…

80cf36a

…tegrate-recovery

Merge branch 'peer-recovery-retention-leases' into 2019-07-25-prrl-in…

062d6c8

…tegrate-recovery

DaveCTurner removed the WIP label Aug 1, 2019

DaveCTurner requested a review from ywelsch August 1, 2019 08:20

ywelsch approved these changes Aug 1, 2019

View reviewed changes

DaveCTurner merged commit 5322b00 into elastic:peer-recovery-retention-leases Aug 1, 2019

DaveCTurner deleted the 2019-07-25-prrl-integrate-recovery branch August 1, 2019 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover peers using history from Lucene #44853

Recover peers using history from Lucene #44853

DaveCTurner commented Jul 25, 2019

elasticmachine commented Jul 25, 2019

DaveCTurner left a comment

DaveCTurner Jul 25, 2019

dnhatn Jul 26, 2019

DaveCTurner Jul 25, 2019

dnhatn left a comment

dnhatn Jul 26, 2019

DaveCTurner Jul 27, 2019

dnhatn Jul 26, 2019

dnhatn Jul 26, 2019

DaveCTurner Jul 27, 2019 •

edited

Loading

dnhatn Jul 26, 2019

dnhatn Jul 26, 2019 •

edited

Loading

DaveCTurner Jul 27, 2019

DaveCTurner Jul 27, 2019

DaveCTurner Jul 27, 2019

DaveCTurner Jul 29, 2019

DaveCTurner left a comment

DaveCTurner Jul 27, 2019

DaveCTurner Jul 27, 2019 •

edited

Loading

DaveCTurner Jul 27, 2019

ywelsch left a comment

dnhatn left a comment

ywelsch left a comment

DaveCTurner commented Jul 30, 2019

dnhatn commented Jul 30, 2019

DaveCTurner commented Jul 31, 2019

DaveCTurner commented Aug 1, 2019

DaveCTurner commented Aug 1, 2019

DaveCTurner commented Aug 1, 2019

ywelsch left a comment

Recover peers using history from Lucene #44853

Recover peers using history from Lucene #44853

Conversation

DaveCTurner commented Jul 25, 2019

elasticmachine commented Jul 25, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner Jul 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn Jul 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner Jul 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

dnhatn left a comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

DaveCTurner commented Jul 30, 2019

dnhatn commented Jul 30, 2019

DaveCTurner commented Jul 31, 2019

DaveCTurner commented Aug 1, 2019

DaveCTurner commented Aug 1, 2019

DaveCTurner commented Aug 1, 2019

ywelsch left a comment

Choose a reason for hiding this comment

DaveCTurner Jul 27, 2019 •

edited

Loading

dnhatn Jul 26, 2019 •

edited

Loading

DaveCTurner Jul 27, 2019 •

edited

Loading