-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Two Races that Lead to Stuck Snapshots #37686
Fix Two Races that Lead to Stuck Snapshots #37686
Conversation
Add NVL as alias to IFNULL as they have the same behaviour. Add basic tests and docs. Closes: elastic#35782
* Forbid negative scores in functon_score query - Throw an exception when scores are negative in field_value_factor function - Throw an exception when scores are negative in script_score function Relates to elastic#33309
… TransportReplicationAction (elastic#35332)"" This reverts commit d3d7c01
Relates elastic#35822
Relates to elastic#35496
This endpoint was not previously documented as it was not particularly useful to end users. However, since the HLRC will support the endpoint we need some documentation to link to. The purpose of the endpoint is to provide defaults and limits used by ML. These are needed to fully understand configurations that have missing values because the missing value means the default should be used. Relates elastic#35777
…#35828) We didn't check that the ExplainLifecycleRequest was constructed with at least one index before, now that we do we must also make sure the tests mutateInstance() method used in equals/hashCode checks doesn't accidentally create an empty index array. Closes elastic#35822
…Action (elastic#35540) This pull request exposes two new methods in the IndexShard and TransportReplicationAction classes in order to allow transport replication actions to acquire all index shard operation permits for their execution. It first adds the acquireAllPrimaryOperationPermits() and the acquireAllReplicaOperationsPermits() methods to the IndexShard class which allow to acquire all operations permits on a shard while exposing a Releasable. It also refactors the TransportReplicationAction class to expose two protected methods (acquirePrimaryOperationPermit() and acquireReplicaOperationPermit()) that can be overridden when a transport replication action requires the acquisition of all permits on primary and/or replica shard during execution. Finally, it adds a TransportReplicationAllPermitsAcquisitionTests which illustrates how a transport replication action can grab all permits before adding a cluster block in the cluster state, making subsequent operations that requires a single permit to fail). Related to elastic elastic#33888
This change fixes analyzed prefix queries in `query_string` to be ignored if all terms are removed during the analysis. Closes elastic#31702
Today when rolling a transog generation we copy the checkpoint from `translog.ckp` to `translog-nnnn.ckp` using a simple `Files.copy()` followed by appropriate `fsync()` calls. The copy operation is not atomic, so if we crash at the wrong moment we can leave an incomplete checkpoint file on disk. In practice the checkpoint is so small that it's either empty or fully written. However, we do not correctly handle the case where it's empty when the node restarts. In contrast, in `recoverFromFiles()` we _do_ copy the checkpoint atomically. This commit extracts the atomic copy operation from `recoverFromFiles()` and re-uses it in `rollGeneration()`.
Jenkins run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few smaller comments, looking good o.w.
} | ||
remoteFailedRequestDeduplicator.executeOnce( | ||
new UpdateIndexShardSnapshotStatusRequest(snapshot, shardId, status), | ||
ActionListener.noop(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps add an ActionListener with trace logging to tell us if it succeeded or not. I've seen this a few times in the last weeks where a NOOP ActionListener was used but most times some trace logging telling us about completion of the action would have been useful. Let's avoid introducing the noop() listener.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense thanks :) Removed it from production code. (kept it in the tests though like it was before, I think logging in these tests is probably just noise and if there's an issue we can reproduce it step by step anyway?)
} | ||
}); | ||
} catch (Exception e) { | ||
logger.warn(() -> new ParameterizedMessage("[{}] [{}] failed to update snapshot state", snapshot, status), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we call reqListener.onFailure here? Also, when do we actually expect this to fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, looked into this. This is never hit, the transport logic catches everything (basically just the disconnected exception) and passes it to the listener => moved the logging to the listener since this is just dead code imo.
@@ -200,6 +215,76 @@ public void testSuccessfulSnapshot() { | |||
assertEquals(0, snapshotInfo.failedShards()); | |||
} | |||
|
|||
public void testSnapshotWithNodeDisconnects() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename this class to SnapshotResiliencyTests?
@ywelsch thanks for taking a look, addressed the comments => should be good for another review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ywelsch thanks! |
Is this backported? If so, remove backport label? |
@ywelsch no, unfortunately, it is not. I cannot backport these tests and I didn't yet have time to finish writing 6.x compatible tests for this scenario.
|
ok, let's do the tests next week and then backport. I had forgotten about the fact that this is not a straightforward backport and thought you might have missed to remove the label :) |
…37870) * Extracted the logic for master request duplication so it can be reused by the snapshotting logic * Removed custom listener used by `ShardStateAction` to not leak these into future users of this class * Changed semantics slightly to get rid of redundant instantiations of the composite listener * Relates elastic#37686
…37870) * Extracted the logic for master request duplication so it can be reused by the snapshotting logic * Removed custom listener used by `ShardStateAction` to not leak these into future users of this class * Changed semantics slightly to get rid of redundant instantiations of the composite listener * Relates elastic#37686
…39399) * Extracted the logic for master request duplication so it can be reused by the snapshotting logic * Removed custom listener used by `ShardStateAction` to not leak these into future users of this class * Changed semantics slightly to get rid of redundant instantiations of the composite listener * Relates #37686
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686) - Improve resilience SnapshotShardService (elastic/elasticsearch#36113) - Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686) - Improve resilience SnapshotShardService (elastic/elasticsearch#36113) - Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686) - Improve resilience SnapshotShardService (elastic/elasticsearch#36113) - Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686) - Improve resilience SnapshotShardService (elastic/elasticsearch#36113) - Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686) - Improve resilience SnapshotShardService (elastic/elasticsearch#36113) - Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686) - Improve resilience SnapshotShardService (elastic/elasticsearch#36113) - Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686) - Improve resilience SnapshotShardService (elastic/elasticsearch#36113) - Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
INIT
when deletingINIT
state by theSnapshotsShardService
will not be notified as failed, leading to the snapshot staying inABORTED
state and never getting deleted with one or more shards stuck inABORTED
state6.x
with the least amount of risk