Reset replica engine to global checkpoint on promotion #33473

dnhatn · 2018-09-06T16:44:59Z

When a replica starts following a newly promoted primary, it may have
some operations which don't exist on the new primary. Thus we need to
throw those operations to align a replica with the new primary. This can
be done by first resetting an engine from the safe commit, then replaying
the local translog up to the global checkpoint.

Relates #32867

When a replica starts following a newly promoted primary, it may have some operations which don't exist on the new primary. Thus we need to throw those operations to align a replica with the new primary. This can be done by resetting an engine from the safe commit, then replaying the local translog up to the global checkpoint.

elasticmachine · 2018-09-06T16:45:00Z

Pinging @elastic/es-distributed

dnhatn · 2018-09-06T16:45:08Z

This is the first part of #32867 (comment).

s1monw

left some comments. This looks much simpler than the overall one.

s1monw · 2018-09-10T09:06:30Z

server/src/main/java/org/elasticsearch/index/engine/Engine.java


            public boolean isRecovery() {
                return this == PEER_RECOVERY || this == LOCAL_TRANSLOG_RECOVERY;
            }
+
+            boolean isLocal() {


maybe we call it isRemote() and then we don't need to invert the if statements in the Engine?

s1monw · 2018-09-10T09:08:01Z

server/src/main/java/org/elasticsearch/index/engine/Engine.java

@@ -1163,11 +1157,16 @@ public Operation(Term uid, long seqNo, long primaryTerm, long version, VersionTy
            PRIMARY,
            REPLICA,
            PEER_RECOVERY,
-            LOCAL_TRANSLOG_RECOVERY;
+            LOCAL_TRANSLOG_RECOVERY,
+            LOCAL_RESETTING;


should we call this LOCAL_RESET then it's consistent with RECOVERY not a continuous form

s1monw · 2018-09-10T09:09:59Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+    }
+
+    private int runTranslogRecovery(Engine engine, Translog.Snapshot snapshot, Engine.Operation.Origin origin,
+                                    Runnable onPerOperationRecovered) throws IOException {


just call it onOperationRecovered?

s1monw · 2018-09-10T09:13:27Z

test/framework/src/main/java/org/elasticsearch/index/shard/IndexShardTestCase.java

+            }
+            shard.close("test", false);
+        } finally {
+            IOUtils.close(shard.store());


maybe put this into a try / with block?

s1monw · 2018-09-10T09:18:58Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+            translogRecoveryStats::incrementRecoveredOperations);
+    }
+
+    private int runTranslogRecoveryAfterResetting(Engine engine, Translog.Snapshot snapshot) throws IOException {


can we add a javadoc comment what this does vs. the ordinary recovery. I also wonder if we should maybe only have this version int runTranslogRecovery(Engine engine, Translog.Snapshot snapshot, Engine.Operation.Origin origin, Runnable onPerOperationRecovered) throws IOException and pass in closures where we actually call it. Since we really only call it from a single place. It would make this class a little less complex

bleskes

Thanks @dnhatn . I left a bunch of comments. Looking good.

bleskes · 2018-09-10T15:03:48Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -833,7 +834,7 @@ public IndexResult index(Index index) throws IOException {
                    indexResult = new IndexResult(
                            plan.versionForIndexing, getPrimaryTerm(), plan.seqNoForIndexing, plan.currentNotFoundOrDeleted);
                }
-                if (index.origin() != Operation.Origin.LOCAL_TRANSLOG_RECOVERY) {
+                if (index.origin().isRemote()) {


maybe rename this to isTranslog? then it will tie directly to what's happening in this code.

bleskes · 2018-09-10T15:06:03Z

server/src/main/java/org/elasticsearch/index/seqno/LocalCheckpointTracker.java

@@ -109,6 +109,7 @@ public synchronized void markSeqNoAsCompleted(final long seqNo) {
     * @param checkpoint the local checkpoint to reset this tracker to
     */
    public synchronized void resetCheckpoint(final long checkpoint) {
+        // TODO: remove this method as we no longer need it.


what are we waiting on?

We have tests which verify that we restore the local checkpoint after resetting it to the global checkpoint. I decided to leave out this method in PR to minimize the changes. I will remove this method in the next PR.

bleskes · 2018-09-10T15:14:57Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

-    int runTranslogRecovery(Engine engine, Translog.Snapshot snapshot) throws IOException {
-        recoveryState.getTranslog().totalOperations(snapshot.totalOperations());
-        recoveryState.getTranslog().totalOperationsOnStart(snapshot.totalOperations());
+    int runTranslogRecovery(Engine engine, Translog.Snapshot snapshot, Engine.Operation.Origin origin,


can we add some java docs to what to onOperationRecovered mean?

bleskes · 2018-09-10T15:22:22Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+        final String translogUUID = store.readLastCommittedSegmentsInfo().getUserData().get(Translog.TRANSLOG_UUID_KEY);
+        final long globalCheckpoint = Translog.readGlobalCheckpoint(translogConfig.getTranslogPath(), translogUUID);
+        final long minRetainedTranslogGen = Translog.readMinTranslogGeneration(translogConfig.getTranslogPath(), translogUUID);
+        store.trimUnsafeCommits(globalCheckpoint, minRetainedTranslogGen, config.getIndexSettings().getIndexVersionCreated());


it feels weird to do these things here - this method now only creates an engine but doesn't change the IndexShard fields - imo it shouldn't touch the store (because it doesn't know anything about any current engine running it)

bleskes · 2018-09-10T15:22:56Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+        final Engine engine = engineFactory.newReadWriteEngine(config);
+        onNewEngine(engine);
+        engine.onSettingsChanged();
+        active.set(true);


same comment - it's weird this change the IndexShard active state without actually exposing the engine.

bleskes · 2018-09-10T15:33:17Z

server/src/test/java/org/elasticsearch/index/replication/IndexLevelReplicationTests.java

-            // - replica1 has {doc1}
-            // - replica2 has {doc1, doc2}
-            // - replica3 can have either {doc2} only if operation-based recovery or {doc1, doc2} if file-based recovery
+            shards.assertAllEqual(initDocs + 1);


bleskes · 2018-09-10T15:34:41Z

server/src/test/java/org/elasticsearch/index/replication/IndexLevelReplicationTests.java

-                final List<Translog.Operation> expectedOps = new ArrayList<>(initOperations);
-                expectedOps.add(op2);
-                assertThat(snapshot, containsOperationsInAnyOrder(expectedOps));
+                List<Translog.Operation> operations = TestTranslog.drainAll(snapshot);


we lost the check that initOperations are also part of the snapshot?

bleskes · 2018-09-10T15:37:43Z

server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

@@ -1879,13 +1873,16 @@ public void testRecoverFromStoreRemoveStaleOperations() throws Exception {
            SourceToParse.source(indexName, "_doc", "doc-1", new BytesArray("{}"), XContentType.JSON));
        flushShard(shard);
        assertThat(getShardDocUIDs(shard), containsInAnyOrder("doc-0", "doc-1"));
-        // Simulate resync (without rollback): Noop #1, index #2
-        acquireReplicaOperationPermitBlockingly(shard, shard.pendingPrimaryTerm + 1);
+        // Here we try to simulate the primary fail-over without rollback which is no longer the case.


I don't follow this comment. Can you clarify please?

server/src/test/java/org/elasticsearch/index/shard/IndexShardTests.java

bleskes · 2018-09-10T15:41:55Z

test/framework/src/main/java/org/elasticsearch/test/ESIntegTestCase.java

+                            .indexServiceSafe(replicaShardRouting.index()).getShard(replicaShardRouting.id());
+                        final Set<String> docsOnReplica;
+                        try {
+                            docsOnReplica = IndexShardTestCase.getShardDocUIDs(replicaShard);


will it be a lot of work to check that the source, primary terms and seq# are also identical?

dnhatn · 2018-09-11T00:07:16Z

@s1monw and @bleskes This is ready for another round. Can you please have a look?

ywelsch · 2018-09-11T07:59:42Z

server/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+            verifyNotClosed();
+            IOUtils.close(currentEngineReference.getAndSet(null));
+            trimUnsafeCommits();
+            newEngine = createNewEngine(newEngineConfig());


just wondering. Would it make sense to do the trimUnsafeCommits as part of the new engine creation?

It was before, but we prefer not to modify the Store implicitly (#33473 (comment)).

s1monw

LGTM

dnhatn · 2018-09-12T02:09:20Z

@bleskes, @s1monw and @ywelsch Thanks for reviewing :).

If a shard is empty, it won't rollback its engine on promotion. This commit adjusts the expectation in the rollback test. Relates #33473

* master: (43 commits) [HLRC][ML] Add ML put datafeed API to HLRC (elastic#33603) Update AWS SDK to 1.11.406 in repository-s3 (elastic#30723) Expose CCR stats to monitoring (elastic#33617) [Docs] Update match-query.asciidoc (elastic#33610) TEST: Adjust rollback condition when shard is empty [CCR] Improve shard follow task's retryable error handling (elastic#33371) Forbid negative `weight` in Function Score Query (elastic#33390) Clarify context suggestions filtering and boosting (elastic#33601) Disable CCR REST endpoints if CCR disabled (elastic#33619) Lower version on full cluster restart settings test Upgrade remote cluster settings (elastic#33537) NETWORKING: http.publish_host Should Contain CNAME (elastic#32806) Add test coverage for global checkpoint listeners Reset replica engine to global checkpoint on promotion (elastic#33473) HLRC: ML Delete Forecast API (elastic#33526) Remove debug logging in full cluster restart tests (elastic#33612) Expose CCR to the transport client (elastic#33608) Mute testIndexDeletionWhenNodeRejoins SQL: Make Literal a NamedExpression (elastic#33583) [DOCS] Adds missing built-in user information (elastic#33585) ...

* master: (128 commits) [HLRC][ML] Add ML put datafeed API to HLRC (elastic#33603) Update AWS SDK to 1.11.406 in repository-s3 (elastic#30723) Expose CCR stats to monitoring (elastic#33617) [Docs] Update match-query.asciidoc (elastic#33610) TEST: Adjust rollback condition when shard is empty [CCR] Improve shard follow task's retryable error handling (elastic#33371) Forbid negative `weight` in Function Score Query (elastic#33390) Clarify context suggestions filtering and boosting (elastic#33601) Disable CCR REST endpoints if CCR disabled (elastic#33619) Lower version on full cluster restart settings test Upgrade remote cluster settings (elastic#33537) NETWORKING: http.publish_host Should Contain CNAME (elastic#32806) Add test coverage for global checkpoint listeners Reset replica engine to global checkpoint on promotion (elastic#33473) HLRC: ML Delete Forecast API (elastic#33526) Remove debug logging in full cluster restart tests (elastic#33612) Expose CCR to the transport client (elastic#33608) Mute testIndexDeletionWhenNodeRejoins SQL: Make Literal a NamedExpression (elastic#33583) [DOCS] Adds missing built-in user information (elastic#33585) ...

When a replica starts following a newly promoted primary, it may have some operations which don't exist on the new primary. Thus we need to throw those operations to align a replica with the new primary. This can be done by first resetting an engine from the safe commit, then replaying the local translog up to the global checkpoint. Relates #32867

If a shard is empty, it won't rollback its engine on promotion. This commit adjusts the expectation in the rollback test. Relates #33473

Relates #33473 Relates #33616

If a shard was serving as a replica when another shard was promoted to primary, then its Lucene index was reset to the global checkpoint. However, if the new primary fails before the primary/replica resync completes and we are now being promoted, we have to restore the reverted operations by replaying the translog to avoid losing acknowledged writes. Relates #33473 Relates #32867

Today we use the version of a DirectoryReader as a component of the key of IndicesRequestCache. This usage is perfectly fine since the version is advanced every time a new change is made into IndexWriter. In other words, two DirectoryReader with the same version should have the same content. However, this invariant is only guaranteed in the context of a single IndexWriter because the version is reset to the committed version value when IndexWriter is re-opened. Since elastic#33473, each IndexShard may have more than one IndexWriter, and using the version of a DirectoryReader as a part of the cache key can cause IndicesRequestCache to return stale cached values. For example, in elastic#27650, we rollback the engine (i.e., re-open IndexWriter), index new documents, refresh, then make a count request, but the search layer mistakenly returns the count of the DirectoryReader of the previous IndexWriter because the current DirectoryReader has the same version to the old DirectoryReader. This is possible because these two readers come from different IndexWriters. This commit replaces the the version with the reader cache key of IndexReader as a component of the cache key of IndicesRequestCache. Closes elastic#27650 Relates elastic#33473

Today we use the version of a DirectoryReader as a component of the key of IndicesRequestCache. This usage is perfectly fine since the version is advanced every time a new change is made into IndexWriter. In other words, two DirectoryReaders with the same version should have the same content. However, this invariant is only guaranteed in the context of a single IndexWriter because the version is reset to the committed version value when IndexWriter is re-opened. Since #33473, each IndexShard may have more than one IndexWriter, and using the version of a DirectoryReader as a part of the cache key can cause IndicesRequestCache to return stale cached values. For example, in #27650, we rollback the engine (i.e., re-open IndexWriter), index new documents, refresh, then make a count request, but the search layer mistakenly returns the count of the DirectoryReader of the previous IndexWriter because the current DirectoryReader has the same version of the old DirectoryReader even their documents are different. This is possible because these two readers come from different IndexWriters. This commit replaces the the version with the reader cache key of IndexReader as a component of the cache key of IndicesRequestCache. Closes #27650 Relates #33473

If a shard was serving as a replica when another shard was promoted to primary, then its Lucene index was reset to the global checkpoint. However, if the new primary fails before the primary/replica resync completes and we are now being promoted, we have to restore the reverted operations by replaying the translog to avoid losing acknowledged writes. Relates #33473 Relates #32867

Today we use the version of a DirectoryReader as a component of the key of IndicesRequestCache. This usage is perfectly fine since the version is advanced every time a new change is made into IndexWriter. In other words, two DirectoryReaders with the same version should have the same content. However, this invariant is only guaranteed in the context of a single IndexWriter because the version is reset to the committed version value when IndexWriter is re-opened. Since #33473, each IndexShard may have more than one IndexWriter, and using the version of a DirectoryReader as a part of the cache key can cause IndicesRequestCache to return stale cached values. For example, in #27650, we rollback the engine (i.e., re-open IndexWriter), index new documents, refresh, then make a count request, but the search layer mistakenly returns the count of the DirectoryReader of the previous IndexWriter because the current DirectoryReader has the same version of the old DirectoryReader even their documents are different. This is possible because these two readers come from different IndexWriters. This commit replaces the the version with the reader cache key of IndexReader as a component of the cache key of IndicesRequestCache. Closes #27650 Relates #33473

With Lucene rollback (#33473), we should never have more than one primary term for each sequence number. Therefore we don't have to sort by the primary term when reading soft-deletes.

dnhatn added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.5.0 labels Sep 6, 2018

dnhatn requested review from s1monw, bleskes and ywelsch September 6, 2018 16:44

dnhatn mentioned this pull request Sep 6, 2018

Reset replica engine before primary-replica resync #32867

Closed

fix test

18851c1

dnhatn added the review label Sep 6, 2018

dnhatn added 4 commits September 6, 2018 16:30

assertUniqueSeqNoInLucene does not work with shrink

5ab9d12

remove reset local checkpoint

571fc99

add comment to remove resetCheckpoint

a7902f3

Merge branch 'master' into rollback-replica-on-promotion

19129cd

s1monw reviewed Sep 10, 2018

View reviewed changes

dnhatn added 3 commits September 10, 2018 08:51

Merge branch 'master' into rollback-replica-on-promotion

b307415

simon feedback

b5ec957

single translog runner method

7800e65

bleskes reviewed Sep 10, 2018

View reviewed changes

dnhatn added 3 commits September 10, 2018 17:22

boaz feedback

06f3c02

fix tests

3ca021e

Merge branch 'master' into rollback-replica-on-promotion

50cd92f

dnhatn requested review from s1monw and bleskes September 11, 2018 00:07

ywelsch reviewed Sep 11, 2018

View reviewed changes

s1monw approved these changes Sep 11, 2018

View reviewed changes

assert engine is not running

e4ad8bf

dnhatn merged commit 743327e into elastic:master Sep 12, 2018

dnhatn deleted the rollback-replica-on-promotion branch September 12, 2018 02:09

dnhatn added the backport pending label Sep 12, 2018

dnhatn added a commit that referenced this pull request Sep 12, 2018

TEST: Adjust rollback condition when shard is empty

d9bbb89

If a shard is empty, it won't rollback its engine on promotion. This commit adjusts the expectation in the rollback test. Relates #33473

dnhatn added a commit that referenced this pull request Sep 12, 2018

TEST: Adjust rollback condition when shard is empty

5a4e397

If a shard is empty, it won't rollback its engine on promotion. This commit adjusts the expectation in the rollback test. Relates #33473

dnhatn removed backport pending review labels Sep 12, 2018

dnhatn added a commit that referenced this pull request Sep 13, 2018

Mute testRecoveryWithConcurrentIndexing

dcbbaad

Relates #33473 Relates #33616

dnhatn added a commit that referenced this pull request Sep 13, 2018

Mute testRecoveryWithConcurrentIndexing

c4e30f8

Relates #33473 Relates #33616

This was referenced Sep 20, 2018

[CI] RecoveryIT.testRecoveryWithConcurrentIndexing fails on master expecting more documents #27650

Closed

Restore local history from translog on promotion #33616

Merged

dnhatn mentioned this pull request Oct 1, 2018

Replace version with reader cache key in IndicesRequestCache #34189

Merged

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

dnhatn mentioned this pull request Jul 1, 2019

Remove sort by primary term when reading soft-deletes #43845

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset replica engine to global checkpoint on promotion #33473

Reset replica engine to global checkpoint on promotion #33473

dnhatn commented Sep 6, 2018

elasticmachine commented Sep 6, 2018

dnhatn commented Sep 6, 2018 •

edited

Loading

s1monw left a comment

s1monw Sep 10, 2018

s1monw Sep 10, 2018

s1monw Sep 10, 2018

s1monw Sep 10, 2018

s1monw Sep 10, 2018

bleskes left a comment

bleskes Sep 10, 2018

bleskes Sep 10, 2018

dnhatn Sep 11, 2018

bleskes Sep 10, 2018

bleskes Sep 10, 2018

bleskes Sep 10, 2018

bleskes Sep 10, 2018

bleskes Sep 10, 2018

bleskes Sep 10, 2018

bleskes Sep 10, 2018

dnhatn commented Sep 11, 2018

ywelsch Sep 11, 2018

dnhatn Sep 11, 2018

s1monw left a comment

dnhatn commented Sep 12, 2018

Reset replica engine to global checkpoint on promotion #33473

Reset replica engine to global checkpoint on promotion #33473

Conversation

dnhatn commented Sep 6, 2018

elasticmachine commented Sep 6, 2018

dnhatn commented Sep 6, 2018 • edited Loading

s1monw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Sep 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw left a comment

Choose a reason for hiding this comment

dnhatn commented Sep 12, 2018

dnhatn commented Sep 6, 2018 •

edited

Loading