Fix Two Races that Lead to Stuck Snapshots #37686

original-brownbear · 2019-01-22T08:31:01Z

Fixes two broken spots:
1. Master failover while deleting a snapshot that has no shards will get stuck if the new master finds the 0-shard snapshot in INIT when deleting
2. Aborted shards that were never seen in INIT state by the SnapshotsShardService will not be notified as failed, leading to the snapshot staying in ABORTED state and never getting deleted with one or more shards stuck in ABORTED state
Tried to make fixes as short as possible so we can backport to 6.x with the least amount of risk
Significantly extended test infrastructure to reproduce the above two issues
- Two new test runs:
  1. Reproducing the effects of node disconnects/restarts in isolation
  2. Reproducing the effects of disconnects/restarts in parallel with shard relocations and deletes
Relates Update IndexShardSnapshotStatus when an exception is encountered #32265
Closes Snapshot stuck in half-deleted state #32348

Add NVL as alias to IFNULL as they have the same behaviour. Add basic tests and docs. Closes: elastic#35782

* Forbid negative scores in functon_score query - Throw an exception when scores are negative in field_value_factor function - Throw an exception when scores are negative in script_score function Relates to elastic#33309

… TransportReplicationAction (elastic#35332)"" This reverts commit d3d7c01

Relates elastic#35822

Relates elastic#35823

Relates to elastic#35496

…lastic#35821)

This endpoint was not previously documented as it was not particularly useful to end users. However, since the HLRC will support the endpoint we need some documentation to link to. The purpose of the endpoint is to provide defaults and limits used by ML. These are needed to fully understand configurations that have missing values because the missing value means the default should be used. Relates elastic#35777

…#35828) We didn't check that the ExplainLifecycleRequest was constructed with at least one index before, now that we do we must also make sure the tests mutateInstance() method used in equals/hashCode checks doesn't accidentally create an empty index array. Closes elastic#35822

Relates to elastic#29827

…Action (elastic#35540) This pull request exposes two new methods in the IndexShard and TransportReplicationAction classes in order to allow transport replication actions to acquire all index shard operation permits for their execution. It first adds the acquireAllPrimaryOperationPermits() and the acquireAllReplicaOperationsPermits() methods to the IndexShard class which allow to acquire all operations permits on a shard while exposing a Releasable. It also refactors the TransportReplicationAction class to expose two protected methods (acquirePrimaryOperationPermit() and acquireReplicaOperationPermit()) that can be overridden when a transport replication action requires the acquisition of all permits on primary and/or replica shard during execution. Finally, it adds a TransportReplicationAllPermitsAcquisitionTests which illustrates how a transport replication action can grab all permits before adding a cluster block in the cluster state, making subsequent operations that requires a single permit to fail). Related to elastic elastic#33888

This change fixes analyzed prefix queries in `query_string` to be ignored if all terms are removed during the analysis. Closes elastic#31702

Today when rolling a transog generation we copy the checkpoint from `translog.ckp` to `translog-nnnn.ckp` using a simple `Files.copy()` followed by appropriate `fsync()` calls. The copy operation is not atomic, so if we crash at the wrong moment we can leave an incomplete checkpoint file on disk. In practice the checkpoint is so small that it's either empty or fully written. However, we do not correctly handle the case where it's empty when the node restarts. In contrast, in `recoverFromFiles()` we _do_ copy the checkpoint atomically. This commit extracts the atomic copy operation from `recoverFromFiles()` and re-uses it in `rollGeneration()`.

…esilience

…ion-its

original-brownbear · 2019-01-30T19:54:41Z

@ywelsch #37870 was merged and I made us of it here now (added its use for all state updates instead of creating a special case, hope that's ok?).
This should be good for another review :)

original-brownbear · 2019-01-30T20:14:37Z

Jenkins run elasticsearch-ci/2

ywelsch

A few smaller comments, looking good o.w.

ywelsch · 2019-01-30T22:54:28Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotShardsService.java

-        }
+        remoteFailedRequestDeduplicator.executeOnce(
+            new UpdateIndexShardSnapshotStatusRequest(snapshot, shardId, status),
+            ActionListener.noop(),


perhaps add an ActionListener with trace logging to tell us if it succeeded or not. I've seen this a few times in the last weeks where a NOOP ActionListener was used but most times some trace logging telling us about completion of the action would have been useful. Let's avoid introducing the noop() listener.

Makes sense thanks :) Removed it from production code. (kept it in the tests though like it was before, I think logging in these tests is probably just noise and if there's an issue we can reproduce it step by step anyway?)

ywelsch · 2019-01-30T22:55:43Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotShardsService.java

+                            }
+                        });
+                } catch (Exception e) {
+                    logger.warn(() -> new ParameterizedMessage("[{}] [{}] failed to update snapshot state", snapshot, status), e);


should we call reqListener.onFailure here? Also, when do we actually expect this to fail?

Yea, looked into this. This is never hit, the transport logic catches everything (basically just the disconnected exception) and passes it to the listener => moved the logging to the listener since this is just dead code imo.

ywelsch · 2019-01-30T22:57:26Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotsServiceTests.java

@@ -200,6 +215,76 @@ public void testSuccessfulSnapshot() {
        assertEquals(0, snapshotInfo.failedShards());
    }

+    public void testSnapshotWithNodeDisconnects() {


Should we rename this class to SnapshotResiliencyTests?

…ion-its

original-brownbear · 2019-01-31T07:58:18Z

@ywelsch thanks for taking a look, addressed the comments => should be good for another review :)

ywelsch

LGTM

original-brownbear · 2019-02-01T04:45:09Z

@ywelsch thanks!

ywelsch · 2019-02-07T19:29:10Z

Is this backported? If so, remove backport label?

original-brownbear · 2019-02-07T19:51:02Z

@ywelsch no, unfortunately, it is not.

I cannot backport these tests and I didn't yet have time to finish writing 6.x compatible tests for this scenario.
We have two options here I guess:

Backport without tests
Wait for tests, which I was planning to write over the weekend or Monday

ywelsch · 2019-02-07T19:55:28Z

ok, let's do the tests next week and then backport. I had forgotten about the fact that this is not a straightforward backport and thought you might have missed to remove the label :)

…37870) * Extracted the logic for master request duplication so it can be reused by the snapshotting logic * Removed custom listener used by `ShardStateAction` to not leak these into future users of this class * Changed semantics slightly to get rid of redundant instantiations of the composite listener * Relates elastic#37686

…39399) * Extracted the logic for master request duplication so it can be reused by the snapshotting logic * Removed custom listener used by `ShardStateAction` to not leak these into future users of this class * Changed semantics slightly to get rid of redundant instantiations of the composite listener * Relates #37686

- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686) - Improve resilience SnapshotShardService (elastic/elasticsearch#36113) - Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)

original-brownbear and others added 30 commits November 22, 2018 20:30

SNAPSHOT: Keep SnapshotsInProgress State in Sync with Routing Table (e…

2efffab

…lastic#35710)

SQL: Implement NVL(expr1, expr2) (elastic#35794)

715ae9f

Add NVL as alias to IFNULL as they have the same behaviour. Add basic tests and docs. Closes: elastic#35782

Forbid negative scores in functon_score query (elastic#35709)

3f79476

* Forbid negative scores in functon_score query - Throw an exception when scores are negative in field_value_factor function - Throw an exception when scores are negative in script_score function Relates to elastic#33309

Revert "Revert "[RCI] Check blocks while having index shard permit in…

9293189

… TransportReplicationAction (elastic#35332)"" This reverts commit d3d7c01

Mute test

92390c5

Relates elastic#35822

Mute test InternalEngineTests

d4701a4

Relates elastic#35823

[TEST] escape brackets

ca1b3c6

Relates to elastic#35496

Upgrade to lucene-8.0.0-snapshot-67cdd21996 (elastic#35816)

121a886

Remove unnecessary throws IOException in CompressedXContent.string() (e…

f8a7bf6

…lastic#35821)

[Docs] Correct template example description elastic#35829

51351c5

Fixed response classes in hlrc docs

c17fa7f

[HLRC][ML] Add ML find file structure API (elastic#35833)

d0b5006

Relates to elastic#29827

Fix analyzed prefix query in query_string (elastic#35756)

9b96fc8

This change fixes analyzed prefix queries in `query_string` to be ignored if all terms are removed during the analysis. Closes elastic#31702

Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…

fb09f20

…esilience

Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…

ce4d520

…esilience

Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…

e3f4a99

…esilience

Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…

6734384

…esilience

Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…

fab6896

…esilience

Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…

e72070e

…esilience

Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…

2fff632

…esilience

Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…

a7d9523

…esilience

start

87454cb

bck

39337e0

works but gets stuck on recovery

3b19373

reproducer

e5d73b3

Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…

263c525

…ion-its

ywelsch suggested changes Jan 30, 2019

View reviewed changes

original-brownbear added 2 commits January 31, 2019 08:47

CR: renaming + remove noop listener

69f8f7d

Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…

708bcf6

…ion-its

original-brownbear requested a review from ywelsch January 31, 2019 07:57

ywelsch approved these changes Jan 31, 2019

View reviewed changes

original-brownbear merged commit 0a604e3 into elastic:master Feb 1, 2019

original-brownbear deleted the snapshot-interruption-its branch February 1, 2019 04:45

original-brownbear added the backport pending label Feb 1, 2019

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

original-brownbear mentioned this pull request Feb 26, 2019

Extract TransportRequestDeduplication from ShardStateAction (#37870) #39399

Merged

danielmitterdorfer removed the backport pending label Feb 27, 2019

kovrus mentioned this pull request Apr 24, 2019

Port ES snapshotting code. crate/crate#8601

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Two Races that Lead to Stuck Snapshots #37686

Fix Two Races that Lead to Stuck Snapshots #37686

original-brownbear commented Jan 22, 2019 •

edited

Loading

original-brownbear commented Jan 30, 2019

original-brownbear commented Jan 30, 2019

ywelsch left a comment

ywelsch Jan 30, 2019

original-brownbear Jan 31, 2019

ywelsch Jan 30, 2019

original-brownbear Jan 31, 2019

ywelsch Jan 30, 2019

original-brownbear commented Jan 31, 2019

ywelsch left a comment

original-brownbear commented Feb 1, 2019

ywelsch commented Feb 7, 2019

original-brownbear commented Feb 7, 2019

ywelsch commented Feb 7, 2019

Fix Two Races that Lead to Stuck Snapshots #37686

Fix Two Races that Lead to Stuck Snapshots #37686

Conversation

original-brownbear commented Jan 22, 2019 • edited Loading

original-brownbear commented Jan 30, 2019

original-brownbear commented Jan 30, 2019

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Jan 30, 2019

Choose a reason for hiding this comment

original-brownbear Jan 31, 2019

Choose a reason for hiding this comment

ywelsch Jan 30, 2019

Choose a reason for hiding this comment

original-brownbear Jan 31, 2019

Choose a reason for hiding this comment

ywelsch Jan 30, 2019

Choose a reason for hiding this comment

original-brownbear commented Jan 31, 2019

ywelsch left a comment

Choose a reason for hiding this comment

original-brownbear commented Feb 1, 2019

ywelsch commented Feb 7, 2019

original-brownbear commented Feb 7, 2019

ywelsch commented Feb 7, 2019

original-brownbear commented Jan 22, 2019 •

edited

Loading