Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Two Races that Lead to Stuck Snapshots #37686

Merged

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Jan 22, 2019

  • Fixes two broken spots:
    1. Master failover while deleting a snapshot that has no shards will get stuck if the new master finds the 0-shard snapshot in INIT when deleting
    2. Aborted shards that were never seen in INIT state by the SnapshotsShardService will not be notified as failed, leading to the snapshot staying in ABORTED state and never getting deleted with one or more shards stuck in ABORTED state
  • Tried to make fixes as short as possible so we can backport to 6.x with the least amount of risk
  • Significantly extended test infrastructure to reproduce the above two issues
    • Two new test runs:
      1. Reproducing the effects of node disconnects/restarts in isolation
      2. Reproducing the effects of disconnects/restarts in parallel with shard relocations and deletes
  • Relates Update IndexShardSnapshotStatus when an exception is encountered #32265
  • Closes Snapshot stuck in half-deleted state #32348

original-brownbear and others added 30 commits November 22, 2018 20:30
Add NVL as alias to IFNULL as they have the same
behaviour. Add basic tests and docs.

Closes: elastic#35782
* Forbid negative scores in functon_score query

- Throw an exception when scores are negative in field_value_factor
function
- Throw an exception when scores are negative in script_score
function

Relates to elastic#33309
This endpoint was not previously documented as it was not
particularly useful to end users.  However, since the HLRC
will support the endpoint we need some documentation to
link to.

The purpose of the endpoint is to provide defaults and
limits used by ML.  These are needed to fully understand
configurations that have missing values because the missing
value means the default should be used.

Relates elastic#35777
…#35828)

We didn't check that the ExplainLifecycleRequest was constructed with at least
one index before, now that we do we must also make sure the tests
mutateInstance() method used in equals/hashCode checks doesn't accidentally
create an empty index array.

Closes elastic#35822
…Action (elastic#35540)

This pull request exposes two new methods in the IndexShard and 
TransportReplicationAction classes in order to allow transport replication 
actions to acquire all index shard operation permits for their execution.

It first adds the acquireAllPrimaryOperationPermits() and the 
acquireAllReplicaOperationsPermits() methods to the IndexShard class 
which allow to acquire all operations permits on a shard while exposing 
a Releasable. It also refactors the TransportReplicationAction class to 
expose two protected methods (acquirePrimaryOperationPermit() and 
acquireReplicaOperationPermit()) that can be overridden when a transport 
replication action requires the acquisition of all permits on primary and/or 
replica shard during execution.

Finally, it adds a TransportReplicationAllPermitsAcquisitionTests which
 illustrates how a transport replication action can grab all permits before 
adding a cluster block in the cluster state, making subsequent operations 
that requires a single permit to fail).

Related to elastic elastic#33888
This change fixes analyzed prefix queries in `query_string` to be ignored
if all terms are removed during the analysis.

Closes elastic#31702
Today when rolling a transog generation we copy the checkpoint from
`translog.ckp` to `translog-nnnn.ckp` using a simple `Files.copy()` followed by
appropriate `fsync()` calls. The copy operation is not atomic, so if we crash
at the wrong moment we can leave an incomplete checkpoint file on disk. In
practice the checkpoint is so small that it's either empty or fully written.
However, we do not correctly handle the case where it's empty when the node
restarts.

In contrast, in `recoverFromFiles()` we _do_ copy the checkpoint atomically.
This commit extracts the atomic copy operation from `recoverFromFiles()` and
re-uses it in `rollGeneration()`.
@original-brownbear
Copy link
Member Author

@ywelsch #37870 was merged and I made us of it here now (added its use for all state updates instead of creating a special case, hope that's ok?).
This should be good for another review :)

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/2

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few smaller comments, looking good o.w.

}
remoteFailedRequestDeduplicator.executeOnce(
new UpdateIndexShardSnapshotStatusRequest(snapshot, shardId, status),
ActionListener.noop(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps add an ActionListener with trace logging to tell us if it succeeded or not. I've seen this a few times in the last weeks where a NOOP ActionListener was used but most times some trace logging telling us about completion of the action would have been useful. Let's avoid introducing the noop() listener.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense thanks :) Removed it from production code. (kept it in the tests though like it was before, I think logging in these tests is probably just noise and if there's an issue we can reproduce it step by step anyway?)

}
});
} catch (Exception e) {
logger.warn(() -> new ParameterizedMessage("[{}] [{}] failed to update snapshot state", snapshot, status), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call reqListener.onFailure here? Also, when do we actually expect this to fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, looked into this. This is never hit, the transport logic catches everything (basically just the disconnected exception) and passes it to the listener => moved the logging to the listener since this is just dead code imo.

@@ -200,6 +215,76 @@ public void testSuccessfulSnapshot() {
assertEquals(0, snapshotInfo.failedShards());
}

public void testSnapshotWithNodeDisconnects() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename this class to SnapshotResiliencyTests?

@original-brownbear
Copy link
Member Author

@ywelsch thanks for taking a look, addressed the comments => should be good for another review :)

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

@ywelsch thanks!

@original-brownbear original-brownbear merged commit 0a604e3 into elastic:master Feb 1, 2019
@original-brownbear original-brownbear deleted the snapshot-interruption-its branch February 1, 2019 04:45
@ywelsch
Copy link
Contributor

ywelsch commented Feb 7, 2019

Is this backported? If so, remove backport label?

@original-brownbear
Copy link
Member Author

@ywelsch no, unfortunately, it is not.

I cannot backport these tests and I didn't yet have time to finish writing 6.x compatible tests for this scenario.
We have two options here I guess:

  • Backport without tests
  • Wait for tests, which I was planning to write over the weekend or Monday

@ywelsch
Copy link
Contributor

ywelsch commented Feb 7, 2019

ok, let's do the tests next week and then backport. I had forgotten about the fact that this is not a straightforward backport and thought you might have missed to remove the label :)

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Feb 26, 2019
…37870)

* Extracted the logic for master request duplication so it can be reused by the snapshotting logic
* Removed custom listener used by `ShardStateAction` to not leak these into future users of this class
* Changed semantics slightly to get rid of redundant instantiations of the composite listener
* Relates elastic#37686
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Feb 26, 2019
…37870)

* Extracted the logic for master request duplication so it can be reused by the snapshotting logic
* Removed custom listener used by `ShardStateAction` to not leak these into future users of this class
* Changed semantics slightly to get rid of redundant instantiations of the composite listener
* Relates elastic#37686
original-brownbear added a commit that referenced this pull request Feb 26, 2019
…39399)

* Extracted the logic for master request duplication so it can be reused by the snapshotting logic
* Removed custom listener used by `ShardStateAction` to not leak these into future users of this class
* Changed semantics slightly to get rid of redundant instantiations of the composite listener
* Relates #37686
kovrus added a commit to crate/crate that referenced this pull request Apr 24, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization
    (elastic/elasticsearch#38368)
kovrus added a commit to crate/crate that referenced this pull request Apr 25, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization
    (elastic/elasticsearch#38368)
kovrus added a commit to crate/crate that referenced this pull request Apr 25, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
kovrus added a commit to crate/crate that referenced this pull request Apr 25, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
kovrus added a commit to crate/crate that referenced this pull request Apr 25, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
kovrus added a commit to crate/crate that referenced this pull request Apr 26, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
mergify bot pushed a commit to crate/crate that referenced this pull request Apr 26, 2019
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v6.7.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Snapshot stuck in half-deleted state