Don’t ack if unable to remove failing replica #39584

dnhatn · 2019-03-01T20:05:01Z

Today when a replicated write operation fails to execute on a replica, the primary will reach out to the master to fail that replica (and mark it stale). We then won't ack that request until the master removes the failing replica; otherwise, we will lose the acked operation if the failed replica is still in the in-sync set. However, if a node with the primary is shutting down, we might ack such request even though we are unable to send a shard-failure request to the master. This happens because we ignore NodeClosedException which is triggered when the ClusterService is being closed.

Closes #39467

/cc @bleskes @martijnvg @jasontedor

elasticmachine · 2019-03-01T20:05:02Z

Pinging @elastic/es-distributed

dnhatn · 2019-03-01T20:06:55Z

server/src/test/java/org/elasticsearch/discovery/ClusterDisruptionIT.java

@@ -436,4 +442,54 @@ public void testIndicesDeleted() throws Exception {
        assertFalse(client().admin().indices().prepareExists(idxName).get().isExists());
    }

+    public void testRestartPrimaryNodeWhileIndexing() throws Exception {


I will combine this newly added test to testAckedIndexing in a follow-up.

was this test able to expose the bug without the fix?

I ran 5000 iterations and this failed twice. This test is merely copied from

elasticsearch/x-pack/plugin/ccr/src/test/java/org/elasticsearch/xpack/ccr/FollowerFailOverIT.java

Line 128 in eea65da

public void testFollowIndexAndCloseNode() throws Exception {

ywelsch · 2019-03-04T09:38:15Z

server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java

+            primary.failShard(message, failure);
+        } else {
+            // these can occur if the node is shutting down and are okay any other exception here is not expected and merits investigation.
+            assert failure instanceof NodeClosedException || failure instanceof TransportException : failure;


@bleskes mentioned that the TransportException here should be coming from

elasticsearch/server/src/main/java/org/elasticsearch/transport/TransportService.java

Lines 645 to 649 in 0227260

if (lifecycle.stoppedOrClosed()) {

// if we are not started the exception handling will remove the RequestHolder again and calls the handler to notify

// the caller. It will only notify if the toStop code hasn't done the work yet.

throw new TransportException("TransportService is closed stopped can't send request");

}

We should verify this here with an assertion and (I think in a follow-up) look into throwing a more appropriate exception in TransportService. Possible options are AlreadyClosedException or a custom subclass of TransportException.

I pushed 6afb094.

ywelsch · 2019-03-04T09:39:55Z

server/src/test/java/org/elasticsearch/discovery/ClusterDisruptionIT.java

@@ -436,4 +442,54 @@ public void testIndicesDeleted() throws Exception {
        assertFalse(client().admin().indices().prepareExists(idxName).get().isExists());
    }

+    public void testRestartPrimaryNodeWhileIndexing() throws Exception {


was this test able to expose the bug without the fix?

ywelsch · 2019-03-04T09:41:17Z

server/src/test/java/org/elasticsearch/discovery/ClusterDisruptionIT.java

+        for (ShardRouting shardRouting : clusterState.routingTable().allShards(index)) {
+            if (shardRouting.primary()) {
+                String nodeName = clusterState.nodes().get(shardRouting.currentNodeId()).getName();
+                internalCluster().restartNode(nodeName, new InternalTestCluster.RestartCallback());


perhaps just restart a random node instead of explicitly the one with the primary?

Yep, I pushed ab65f4b

dnhatn · 2019-03-04T17:40:48Z

@ywelsch Thanks for looking. It's ready again.

ywelsch

LGTM.

server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java

dnhatn · 2019-03-05T16:21:20Z

@ywelsch Thanks!

jpountz · 2019-03-06T07:06:13Z

Thanks for letting me know!