Restore local history from translog on promotion #33616

dnhatn · 2018-09-12T02:19:23Z

If a shard was serving as a replica when another shard was promoted to
primary, then its Lucene index was reset to the global checkpoint.
However, if the new primary fails before the primary/replica resync
completes and we are now being promoted, we have to restore the reverted
operations by replaying the translog to avoid losing acknowledged writes.

Relates #33473
Relates #32867

If a shard was serving as a replica when another shard was promoted to primary, then its Lucene index was reset to the global checkpoint. However, if the new primary fails before the primary/replica resync completes and we are now being promoted, we have to restore the reverted operations by replaying the translog to avoid losing acknowledged writes. Relates elastic#32867

elasticmachine · 2018-09-12T02:19:25Z

Pinging @elastic/es-distributed

dnhatn · 2018-09-12T02:41:39Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

-                        localCheckpointTracker.markSeqNoAsCompleted(operation.seqNo());
-                    }
-                }
+                return translogRecoveryRunner.run(this, snapshot);


We can keep track a max_seqno from translog to recover when we rollback this engine (i.e., recover_upto and max_seqno in translog at that time), then only restore if needed. However, I opted out for simplicity.

s1monw · 2018-09-12T07:49:03Z

server/src/main/java/org/elasticsearch/index/engine/ReadOnlyEngine.java

@@ -341,7 +341,9 @@ public void rollTranslogGeneration() {
    }

    @Override
-    public void restoreLocalCheckpointFromTranslog() {
+    public int restoreLocalHistoryFromTranslog(TranslogRecoveryRunner translogRecoveryRunner) {
+        assert false : "this should not be called";


I don't understand why this throws an exception. if you have an index that is read-only and uses this engine and a primary get's promoted this should be a no-op not a UOE?

Yes, we should make this a no-op (just like fillSeqNoGaps).

s1monw

LGTM but I think @bleskes needs to take a look too

Relates #33473 Relates #33616

This reverts commit dcbbaad.

ywelsch

LGTM. Thanks @dnhatn

dnhatn · 2018-09-20T17:19:44Z

Thanks @s1monw and @ywelsch.

If a shard was serving as a replica when another shard was promoted to primary, then its Lucene index was reset to the global checkpoint. However, if the new primary fails before the primary/replica resync completes and we are now being promoted, we have to restore the reverted operations by replaying the translog to avoid losing acknowledged writes. Relates #33473 Relates #32867

dnhatn added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.5.0 labels Sep 12, 2018

dnhatn requested review from s1monw, bleskes and ywelsch September 12, 2018 02:19

dnhatn commented Sep 12, 2018

View reviewed changes

s1monw requested changes Sep 12, 2018

View reviewed changes

dnhatn added 2 commits September 12, 2018 07:51

make restore noop in read-only engine

8010ad5

Merge branch 'master' into restore-history-on-promotion

9bba270

dnhatn requested a review from s1monw September 12, 2018 11:58

dnhatn added the review label Sep 12, 2018

s1monw approved these changes Sep 12, 2018

View reviewed changes

dnhatn added a commit that referenced this pull request Sep 13, 2018

Mute testRecoveryWithConcurrentIndexing

dcbbaad

Relates #33473 Relates #33616

dnhatn added a commit that referenced this pull request Sep 13, 2018

Mute testRecoveryWithConcurrentIndexing

c4e30f8

Relates #33473 Relates #33616

dnhatn added 3 commits September 13, 2018 11:57

Merge branch 'master' into restore-history-on-promotion

6e981de

Revert "Mute testRecoveryWithConcurrentIndexing"

cf98352

This reverts commit dcbbaad.

Merge branch 'master' into restore-history-on-promotion

12de58b

ywelsch approved these changes Sep 20, 2018

View reviewed changes

Merge branch 'master' into restore-history-on-promotion

23d5013

dnhatn merged commit 002f763 into elastic:master Sep 20, 2018

dnhatn deleted the restore-history-on-promotion branch September 20, 2018 17:21

dnhatn added backport pending and removed review labels Sep 20, 2018

This was referenced Sep 20, 2018

[CI] RecoveryIT.testRecoveryWithConcurrentIndexing fails on master expecting more documents #27650

Closed

Reset replica engine before primary-replica resync #32867

Closed

dnhatn removed the backport pending label Sep 20, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore local history from translog on promotion #33616

Restore local history from translog on promotion #33616

dnhatn commented Sep 12, 2018 •

edited

Loading

elasticmachine commented Sep 12, 2018

dnhatn Sep 12, 2018 •

edited

Loading

s1monw Sep 12, 2018

dnhatn Sep 12, 2018

s1monw left a comment

ywelsch left a comment

dnhatn commented Sep 20, 2018

Restore local history from translog on promotion #33616

Restore local history from translog on promotion #33616

Conversation

dnhatn commented Sep 12, 2018 • edited Loading

elasticmachine commented Sep 12, 2018

dnhatn Sep 12, 2018 • edited Loading

Choose a reason for hiding this comment

s1monw Sep 12, 2018

Choose a reason for hiding this comment

dnhatn Sep 12, 2018

Choose a reason for hiding this comment

s1monw left a comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

dnhatn commented Sep 20, 2018

dnhatn commented Sep 12, 2018 •

edited

Loading

dnhatn Sep 12, 2018 •

edited

Loading