-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don’t ack if unable to remove failing replica #39584
Conversation
Pinging @elastic/es-distributed |
@@ -436,4 +442,54 @@ public void testIndicesDeleted() throws Exception { | |||
assertFalse(client().admin().indices().prepareExists(idxName).get().isExists()); | |||
} | |||
|
|||
public void testRestartPrimaryNodeWhileIndexing() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will combine this newly added test to testAckedIndexing
in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this test able to expose the bug without the fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran 5000 iterations and this failed twice. This test is merely copied from
elasticsearch/x-pack/plugin/ccr/src/test/java/org/elasticsearch/xpack/ccr/FollowerFailOverIT.java
Line 128 in eea65da
public void testFollowIndexAndCloseNode() throws Exception { |
primary.failShard(message, failure); | ||
} else { | ||
// these can occur if the node is shutting down and are okay any other exception here is not expected and merits investigation. | ||
assert failure instanceof NodeClosedException || failure instanceof TransportException : failure; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bleskes mentioned that the TransportException here should be coming from
elasticsearch/server/src/main/java/org/elasticsearch/transport/TransportService.java
Lines 645 to 649 in 0227260
if (lifecycle.stoppedOrClosed()) { | |
// if we are not started the exception handling will remove the RequestHolder again and calls the handler to notify | |
// the caller. It will only notify if the toStop code hasn't done the work yet. | |
throw new TransportException("TransportService is closed stopped can't send request"); | |
} |
We should verify this here with an assertion and (I think in a follow-up) look into throwing a more appropriate exception in TransportService. Possible options are AlreadyClosedException
or a custom subclass of TransportException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 6afb094.
@@ -436,4 +442,54 @@ public void testIndicesDeleted() throws Exception { | |||
assertFalse(client().admin().indices().prepareExists(idxName).get().isExists()); | |||
} | |||
|
|||
public void testRestartPrimaryNodeWhileIndexing() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this test able to expose the bug without the fix?
for (ShardRouting shardRouting : clusterState.routingTable().allShards(index)) { | ||
if (shardRouting.primary()) { | ||
String nodeName = clusterState.nodes().get(shardRouting.currentNodeId()).getName(); | ||
internalCluster().restartNode(nodeName, new InternalTestCluster.RestartCallback()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps just restart a random node instead of explicitly the one with the primary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I pushed ab65f4b
@ywelsch Thanks for looking. It's ready again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java
Outdated
Show resolved
Hide resolved
@ywelsch Thanks! |
Thanks for letting me know! |
We need to unwrap and use the actual cause when determining if the node with primary shard is shutting down because TransportService will throw a TransportException wrapped in a SendRequestTransportException. Relates #39584
Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. Closes #39467
We need to unwrap and use the actual cause when determining if the node with primary shard is shutting down because TransportService will throw a TransportException wrapped in a SendRequestTransportException. Relates #39584
Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. Closes #39467
We need to unwrap and use the actual cause when determining if the node with primary shard is shutting down because TransportService will throw a TransportException wrapped in a SendRequestTransportException. Relates #39584
Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. Closes #39467
We need to unwrap and use the actual cause when determining if the node with primary shard is shutting down because TransportService will throw a TransportException wrapped in a SendRequestTransportException. Relates #39584
Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed. Closes #39467
We need to unwrap and use the actual cause when determining if the node with primary shard is shutting down because TransportService will throw a TransportException wrapped in a SendRequestTransportException. Relates #39584
* 6.7: Fix CCR HLRC docs Introduce forget follower API (elastic#39718) 6.6.2 release notes. Update release notes for 6.7.0 Add documentation for min_hash filter (elastic#39671) Unmute testIndividualActionsTimeout Unmute testFollowIndexAndCloseNode Use unwrapped cause to determine if node is closing (elastic#39723) Don’t ack if unable to remove failing replica (elastic#39584) Wipe Snapshots Before Indices in RestTests (elastic#39662) (elastic#39765) Bug fix for AnnotatedTextHighlighter (elastic#39525) Fix Snapshot BwC with Version 5.6.x (elastic#39737) Fix occasional SearchServiceTests failure (elastic#39697) Correct date in daterange-aggregation.asciidoc (elastic#39727) Add a note to purge the ingest-geoip plugin (elastic#39553)
If TransportService is stopped before a shard-failure request is sent but after the request is registered, TransportService will notify ReplicationOperation a TransportException with an error message: "transport stop, action: internal:cluster/shard/failure". Relates #39584
If TransportService is stopped before a shard-failure request is sent but after the request is registered, TransportService will notify ReplicationOperation a TransportException with an error message: "transport stop, action: internal:cluster/shard/failure". Relates #39584
If TransportService is stopped before a shard-failure request is sent but after the request is registered, TransportService will notify ReplicationOperation a TransportException with an error message: "transport stop, action: internal:cluster/shard/failure". Relates #39584
If TransportService is stopped before a shard-failure request is sent but after the request is registered, TransportService will notify ReplicationOperation a TransportException with an error message: "transport stop, action: internal:cluster/shard/failure". Relates #39584
If TransportService is stopped before a shard-failure request is sent but after the request is registered, TransportService will notify ReplicationOperation a TransportException with an error message: "transport stop, action: internal:cluster/shard/failure". Relates #39584
Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed.
Closes #39467
/cc @bleskes @martijnvg @jasontedor