Primary send safe commit in file-based recovery #28038

dnhatn · 2017-12-31T15:42:12Z

Today a primary shard transfers the most recent commit point to a replica shard in a file-based recovery. However, the most recent commit may not be a "safe" commit; this causes a replica shard not having a safe commit point until it can retain a safe commit by itself.

This commits collapses the snapshot deletion policy into the combined deletion policy and modifies the peer recovery source to send a safe commit.

Relates #10708

Today a primary shard transfers the most recent commit to a replica shard in a file-based recovery. This causes replica shards not having a safe commit point until it can retain a safe commit by itself. This commits collapses the snapshot deletion policy into the combined deletion policy and modifies primary shards to send a safe commit.

# Conflicts: # test/framework/src/main/java/org/elasticsearch/index/shard/IndexShardTestCase.java

bleskes

I like it. I made a quick pass with initial comments.

bleskes · 2018-01-09T20:33:59Z

core/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

+        assert lastCommit != null : "Last commit is not initialized yet";
+        final IndexCommit snapshotting = acquiringSafeCommit ? safeCommit : lastCommit;
+        snapshottedCommits.addTo(snapshotting, 1); // increase refCount
+        return new Engine.IndexCommitRef(snapshotting, () -> releaseCommit(snapshotting));


can we leave this to the engine and make the releaseCommit method public?

we can also return a commit wrapper that disable the delete method?

bleskes · 2018-01-09T20:40:16Z

core/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

+        assert refCount >= 0 : "Number of snapshots can not be negative [" + refCount + "]";
+        if (refCount == 0) {
+            snapshottedCommits.remove(releasingCommit);
+            updateTranslogDeletionPolicy();


why does the translog care about the snapshotting situation? it only cares about the safe commit, no?

Assume that we have two commits c1 and c2 with c1 is the safe commit and c2 is the last commit. Clients acquire c1, then we have to keep translog of c1 until they release the commit. During that time, we have a new commit (or global checkpoint advanced), we can release c1 and its translog but have to keep them as they are being snapshotted. When clients release the commit c1, we should also release its translog rather than wait until the next onCommit.

I think we should not release translog here. We should either keep or release both the commit and translog at the same time. I will update this.

bleskes · 2018-01-09T20:42:21Z

core/src/main/java/org/elasticsearch/index/engine/Engine.java

     * @param flushFirst indicates whether the engine should flush before returning the snapshot
     */
-    public abstract IndexCommitRef acquireIndexCommit(boolean flushFirst) throws EngineException;
+    public abstract IndexCommitRef acquireIndexCommit(boolean safeCommit, boolean flushFirst) throws EngineException;


we should look in future to remove one of these booleans - either you want to "get everything" or you want a safe commit. I don't think there's a point in "flush but give me a safe commit"

I agree. Are you ok if we replace this by having two methods: acquireLastIndexCommit(flushFirst) and acquireSafeIndexCommit(no option)?

agree. Are you ok if we replace this by having two methods: acquireLastIndexCommit(flushFirst) and acquireSafeIndexCommit(no option)?

+1

double checking - the new methods will be a follow up?

dnhatn · 2018-01-10T01:48:48Z

@bleskes, I've updated the deletion policy to return a non-deletable commit and wrap it into a commit-ref in the engine. I am not happy with the casting line in the releaseCommit method [final IndexCommit releasingCommit = ((SnapshotIndexCommit) snapshotCommit).delegate;]. I will reach out to discuss this with you.

bleskes

looking good. I left some more minor feedback

bleskes · 2018-01-10T09:40:58Z

core/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

+     * Releases an index commit that acquired by {@link #acquireIndexCommit(boolean)}.
+     */
+    synchronized void releaseCommit(final IndexCommit releasingCommit) {
+        assert snapshottedCommits.containsKey(releasingCommit) : "Release non-snapshotted commit;" +


how does that work? releasingCommit is a SnapshotIndexCommit ?

The snapshotting commits are stored as keys in a HashMap. Both SnapshotIndexCommit and regular index commit inherit equals and hashCode from the root IndexCommit, thus they are interchangeable. This can be problematic if a regular index commit overrides equals or hashCode.

I see some options to avoid this.

Exposes SnapshotIndexCommit to package level; makes acquireCommit and releaseCommit with SnapshotIndexCommit type.

Delegate hashCode and equals of SnapshotIndexCommit to the original index commit.

WDYT?

I see. I think it's risky as the underlying IndexCommit may have a different implementation (it's not final). I see a 3rd option - we could use an identity map and sure people only release index commits they got from us. Until we need the ability to work all kind of crazyness like wrapped IndexCommits, I prefer to keep things strict.

Lucene's SnapshotDeletionPolicy identifies the IndexCommit's based on their generation (long field).

We discussed and agreed to keep the current implementation.

bleskes · 2018-01-10T09:44:33Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

-            logger.trace("pulling snapshot");
-            return new IndexCommitRef(snapshotDeletionPolicy);
-        } catch (IOException e) {
-            throw new SnapshotFailedEngineException(shardId, e);


This means a potential change to the exception type. Can you double check it's OK?

This should be ok. The method snapshot of Lucene's SnapshotDeletionPolicy throws IOException but we don't. Acquiring a commit is just increasing refCount of a commit.

bleskes · 2018-01-10T09:46:36Z

core/src/test/java/org/elasticsearch/index/engine/CombinedDeletionPolicyTests.java

+        final IndexCommit ref2 = indexPolicy.acquireIndexCommit(false);
+        assertThat(ref2, equalTo(c2));
+        expectThrows(UnsupportedOperationException.class, ref2::delete);
+        assertThat(translogPolicy.getMinTranslogGenerationForRecovery(), lessThanOrEqualTo(100L));


shouldn't this be exactly translogGen1?

lol. I fixed

bleskes · 2018-01-10T09:48:10Z

core/src/test/java/org/elasticsearch/index/engine/CombinedDeletionPolicyTests.java

+        assertThat(ref3, equalTo(c2));
+        assertThat(translogPolicy.getMinTranslogGenerationForRecovery(), equalTo(translogGen1));
+        indexPolicy.releaseCommit(ref1); // release acquired commit releases translog and commit
+        indexPolicy.onCommit(Arrays.asList(c1, c2)); // Flush new commit deletes c1


can we also release c2 and see that it is not deleted? I would also appreciate some randomness here - i.e., a series of commits, choose snapshot at random times. Release at random order and check that all is OK.

bleskes · 2018-01-10T15:10:30Z

Thx.

…

On Wed, Jan 10, 2018 at 4:09 PM, Nhat Nguyen ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java <#28038 (comment)> : > // we have to flush outside of the readlock otherwise we might have a problem upgrading // the to a write lock when we fail the engine in this operation if (flushFirst) { logger.trace("start flush for snapshot"); flush(false, true); logger.trace("finish flush for snapshot"); } - try (ReleasableLock lock = readLock.acquire()) { - logger.trace("pulling snapshot"); - return new IndexCommitRef(snapshotDeletionPolicy); - } catch (IOException e) { - throw new SnapshotFailedEngineException(shardId, e); This should be ok. The method snapshot of Lucene's SnapshotDeletionPolicy throws IOException but we don't. Acquiring a commit is just increasing refCount of a commit. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#28038 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA9bJ8XVLnNrL1EiZiS2HJctvgoiFn2uks5tJNKmgaJpZM4RP4GN> .

This reverts commit 0285001.

ywelsch

I've left two suggestions

ywelsch · 2018-01-10T17:03:03Z

core/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

@@ -42,12 +46,16 @@
    private final TranslogDeletionPolicy translogDeletionPolicy;
    private final EngineConfig.OpenMode openMode;
    private final LongSupplier globalCheckpointSupplier;
+    private final ObjectIntHashMap<IndexCommit> snapshottedCommits; // Number of snapshots held against each commit point.
+    private IndexCommit safeCommit; // the most recent safe commit point - its max_seqno at most the persisted global checkpoint.


The safe commit point is a constantly moving target (as the global checkpoint keeps going up and new commits are being added). I wonder if it's nicer to calculate the safe commit point when it's accessed in acquireIndexCommit, based on the then current globalcheckpoint and the current list of commits (This will require storing the last seen indexCommits, but that would be equivalent to what's being done in Lucene's SnapshotDeletionPolicy).

@ywelsch Nhat has a follow up to trim unneeded commits as soon as the global checkpoint advances enough. This will have a side effect you mention (it will update things) with some added value.

yes, but that involves some extra machinery to update safeCommit at the right points in time (e.g. when gcp advances)? My suggestion here makes that unnecessary?

Ok, I think the added value of the approach in the follow-up is that the clean-up logic is also possibly more eagerly invoked (when gcp advances).

ywelsch · 2018-01-10T17:05:58Z

core/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

+     * Releases an index commit that acquired by {@link #acquireIndexCommit(boolean)}.
+     */
+    synchronized void releaseCommit(final IndexCommit releasingCommit) {
+        assert snapshottedCommits.containsKey(releasingCommit) : "Release non-snapshotted commit;" +


Lucene's SnapshotDeletionPolicy identifies the IndexCommit's based on their generation (long field).

dnhatn · 2018-01-10T18:54:43Z

@bleskes

I've added a random test and updated the index policy not to retain translog of the snapshotting commits.
I realized that a delegating equals does not fully conform the equals contract. I decided to cast a releasing commit to the inner class. @ywelsch pointed out that Lucene's snapshot policy uses a commit-generation as a key (we are using generation and directory). I am fine with the suggestion, WDYT?

bleskes

LGTM. Thanks @dnhatn

bleskes · 2018-01-10T23:05:09Z

core/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java

        final int keptPosition = indexOfKeptCommits(commits, globalCheckpointSupplier.getAsLong());
+        lastCommit = commits.get(commits.size() - 1);


can we assert that the translog gen is in this commit is lower than all the ones in higher commits?

bleskes · 2018-01-10T23:06:00Z

core/src/main/java/org/elasticsearch/index/engine/Engine.java

     * @param flushFirst indicates whether the engine should flush before returning the snapshot
     */
-    public abstract IndexCommitRef acquireIndexCommit(boolean flushFirst) throws EngineException;
+    public abstract IndexCommitRef acquireIndexCommit(boolean safeCommit, boolean flushFirst) throws EngineException;


double checking - the new methods will be a follow up?

dnhatn · 2018-01-11T15:38:08Z

Thanks @bleskes and @ywelsch for reviewing.

Currently we keep a 5.x index commit as a safe commit until we have a 6.x safe commit. During that time, if peer-recovery happens, a primary will send a 5.x commit in file-based sync and the recovery will even fail as the snapshotted commit does not have sequence number tags. This commit updates the combined deletion policy to delete legacy commits if there are 6.x commits. Relates elastic#27606 Relates elastic#28038

* master: (43 commits) Rename core module to server (#28180) upgraded jna from 4.4.0-1 to 4.5.1 (#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads Primary send safe commit in file-based recovery (#28038) [Docs] Correct response json in rank-eval.asciidoc Add scroll parameter to _reindex API (#28041) Include all sentences smaller than fragment_size in the unified highlighter (#28132) Modifies the JavaAPI docs related to AggregationBuilder [Docs] Improvements in script-fields.asciidoc (#28174) [Docs] Remove Kerberos/SPNEGO Shield plugin (#28019) Ignore null value for range field (#27845) (#28116) Fix environment variable substitutions in list setting (#28106) docs: Replaces indexed script java api docs with stored script api docs test: ensure we endup with a single segment Make sure that we don't detect files as maven coordinate when installing a plugin (#28163) [Tests] temporary disable meta plugin rest tests #28163 meta-plugin should install bin and config at the top level (#28162) Painless: Add public member read/write access test. (#28156) Docs: Clarify password protection support with keystore (#28157) [Docs] fix plugin properties inclusion for plugins authors ...

Currently we keep a 5.x index commit as a safe commit until we have a 6.x safe commit. During that time, if peer-recovery happens, a primary will send a 5.x commit in file-based sync and the recovery will even fail as the snapshotted commit does not have sequence number tags. This commit updates the combined deletion policy to delete legacy commits if there are 6.x commits. Relates #27606 Relates #28038

If a 6.x node with a 5.x index is promoted to be a primary, it will flush a new index commit to make sure translog operations without seqno will never be replayed (see IndexShard#updateShardState). However the global checkpoint is still UNASSIGNED and the max_seqno of both commits are NO_OPS_PERFORMED. If the combined deletion policy considers the first commit as a safe commit, we will send the first commit without replaying translog between these commits to the replica in a peer-recovery. This causes the replica missing those operations. To prevent this, we should not keep more than one commit whose max_seqno is NO_OPS_PERFORMED. Once we can retain a safe commit, a NO_OPS_PERFORMED commit will be deleted just as other commits. Relates #28038

Today a primary shard transfers the most recent commit point to a replica shard in a file-based recovery. However, the most recent commit may not be a "safe" commit; this causes a replica shard not having a safe commit point until it can retain a safe commit by itself. This commits collapses the snapshot deletion policy into the combined deletion policy and modifies the peer recovery source to send a safe commit. Relates #10708

The global checkpoint should be assigned to unassigned rather than 0. If a single document is indexed and the global checkpoint is initialized with 0, the first commit is safe which the test does not suppose. Relates #28038

* master: (30 commits) Fix lock accounting in releasable lock Add ability to associate an ID with tasks (elastic#27764) [DOCS] Removed differencies between text and code (elastic#27993) text fixes (elastic#28136) Update getting-started.asciidoc (elastic#28145) [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187) Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186) Do not keep 5.x commits once having 6.x commits (elastic#28188) Rename core module to server (elastic#28180) upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads Primary send safe commit in file-based recovery (elastic#28038) [Docs] Correct response json in rank-eval.asciidoc Add scroll parameter to _reindex API (elastic#28041) Include all sentences smaller than fragment_size in the unified highlighter (elastic#28132) Modifies the JavaAPI docs related to AggregationBuilder [Docs] Improvements in script-fields.asciidoc (elastic#28174) [Docs] Remove Kerberos/SPNEGO Shield plugin (elastic#28019) Ignore null value for range field (elastic#27845) (elastic#28116) Fix environment variable substitutions in list setting (elastic#28106) ...

* master: (59 commits) Correct backport replica rollback to 6.2 (elastic#28181) Backport replica rollback to 6.2 (elastic#28181) Rename deleteLocalTranslog to createNewTranslog AwaitsFix #testRecoveryAfterPrimaryPromotion TEST: init unassigned gcp in testAcquireIndexCommit Replica start peer recovery with safe commit (elastic#28181) Truncate tlog cli should assign global checkpoint (elastic#28192) Fix lock accounting in releasable lock Add ability to associate an ID with tasks (elastic#27764) [DOCS] Removed differencies between text and code (elastic#27993) text fixes (elastic#28136) Update getting-started.asciidoc (elastic#28145) [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187) Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186) Do not keep 5.x commits once having 6.x commits (elastic#28188) Rename core module to server (elastic#28180) upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads Primary send safe commit in file-based recovery (elastic#28038) [Docs] Correct response json in rank-eval.asciidoc ...

* compile-with-jdk-9: (56 commits) TEST: init unassigned gcp in testAcquireIndexCommit Replica start peer recovery with safe commit (elastic#28181) Truncate tlog cli should assign global checkpoint (elastic#28192) Fix lock accounting in releasable lock Add ability to associate an ID with tasks (elastic#27764) [DOCS] Removed differencies between text and code (elastic#27993) text fixes (elastic#28136) Update getting-started.asciidoc (elastic#28145) [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187) Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186) Do not keep 5.x commits once having 6.x commits (elastic#28188) Rename core module to server (elastic#28180) upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183) [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads Primary send safe commit in file-based recovery (elastic#28038) [Docs] Correct response json in rank-eval.asciidoc Add scroll parameter to _reindex API (elastic#28041) Include all sentences smaller than fragment_size in the unified highlighter (elastic#28132) Modifies the JavaAPI docs related to AggregationBuilder [Docs] Improvements in script-fields.asciidoc (elastic#28174) ...

Previously we introduced a new parameter to `acquireIndexCommit` to allow acquire either a safe commit or a last commit. However with the new parameter callers can provide a nonsense combination - flush first but acquire the safe commit. This commit separates acquireIndexCommit method into two different methods to avoid that problem. Moreover, this change should also improve the readability. Follow-up elastic#28038

Previously we introduced a new parameter to `acquireIndexCommit` to allow acquire either a safe commit or a last commit. However with the new parameters, callers can provide a nonsense combination - flush first but acquire the safe commit. This commit separates acquireIndexCommit method into two different methods to avoid that problem. Moreover, this change should also improve the readability. Relates #28038

- `NodeShouldNotConnectException` has not been instantiated since 5.0 - `GatewayException` has not been instantiated since 5.0 - `SnapshotFailedEngineException` has not been instantated since 6.2.0 (elastic#28038) and was never thrown across clusters This commit removes these obsolete exceptions.

- `NodeShouldNotConnectException` has not been instantiated since 5.0 - `GatewayException` has not been instantiated since 5.0 - `SnapshotFailedEngineException` has not been instantated since 6.2.0 (#28038) and was never thrown across clusters This commit removes these obsolete exceptions.

dnhatn added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement review v6.2.0 v7.0.0 labels Dec 31, 2017

dnhatn requested review from bleskes, ywelsch and jasontedor December 31, 2017 15:42

dnhatn changed the title ~~Primary sends a safe commit in file-based recovery~~ Primary send safe commit in file-based recovery Dec 31, 2017

dnhatn added 2 commits December 31, 2017 11:13

assert lock held when updating translog gens

00ea4f3

Merge branch 'master' into recovery/primary-send-safe-commit

7f4d848

# Conflicts: # test/framework/src/main/java/org/elasticsearch/index/shard/IndexShardTestCase.java

bleskes suggested changes Jan 9, 2018

View reviewed changes

Return non-deletable commit

cce0742

Avoid casting - manually mock index commit

0285001

bleskes suggested changes Jan 10, 2018

View reviewed changes

Revert "Avoid casting - manually mock index commit"

5b47f12

This reverts commit 0285001.

ywelsch reviewed Jan 10, 2018

View reviewed changes

dnhatn added 2 commits January 10, 2018 13:45

Add a random test

08d1612

Merge branch 'master' into recovery/primary-send-safe-commit

c8a9b2d

bleskes approved these changes Jan 10, 2018

View reviewed changes

dnhatn merged commit 626c3d1 into elastic:master Jan 11, 2018

dnhatn deleted the recovery/primary-send-safe-commit branch January 11, 2018 15:39

dnhatn added the backport pending label Jan 11, 2018

dnhatn mentioned this pull request Jan 11, 2018

Do not keep 5.x commits when having 6.x commits #28188

Merged

dnhatn removed the backport pending label Jan 12, 2018

bleskes mentioned this pull request Jan 12, 2018

Add Sequence Numbers to write operations #10708

Closed

64 tasks

dnhatn mentioned this pull request Jan 17, 2018

Separate acquiring safe commit and last commit #28271

Merged

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

DaveCTurner mentioned this pull request Mar 1, 2021

Remove some obsolete exceptions #69675

Merged

		final int keptPosition = indexOfKeptCommits(commits, globalCheckpointSupplier.getAsLong());
		lastCommit = commits.get(commits.size() - 1);

Primary send safe commit in file-based recovery #28038

Primary send safe commit in file-based recovery #28038

Conversation

dnhatn commented Dec 31, 2017 • edited Loading

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn Jan 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Jan 10, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Jan 10, 2018 via email

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Jan 10, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Jan 11, 2018

dnhatn commented Dec 31, 2017 •

edited

Loading

dnhatn Jan 9, 2018 •

edited

Loading