-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node-left ... reason: disconnected
triggered by earlier node-left
event rather than network issue
#67873
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
I managed to catch another instance of this with better logging. It looks like we have a race in the coordination layer so I'm relabelling this. I suspect the network-layer issue I mentioned may also need addressing before this is fixed. Here's the breakdown: The node is dropped from the cluster due to health check failures
The node starts to rejoin the cluster while the removal publicaation is still in flight. NB we ensure that the transport service is connected to this node as soon as we receive the join request.
The removal publication completes, so the master applies its state and disconnects from the node.
The master now processes the join, informing the
However we're no longer connected to the node, so the first follower check fails immeidately.
The node is dropped from the cluster again, this time with
|
Hi @DaveCTurner , |
No, sorry, there's definitely a bug here but it's in a very tricky area and the fix is very low priority: you can only hit it if your cluster is already broken and the only effect in real terms is a bit of extra logging. A PR would be welcome, if it's important to you. |
Hi @DaveCTurner, we have also come across this issue with the same error messages. can you please share the ES version you saw it? is there any fix or workaround you have found? |
I imagine this affects all 7.x versions, at least the non-EOL ones since there haven't been any recent changes in this area. The only real effect a bit of extra logging so you can work around this by ignoring those messages. |
Hi @DaveCTurner, we seems come across simlar issue on 7.5.1. Would you please help to check that it's the same root cause? One of the data node has been removed by master due to health check:
The removed data node log:
After the above node has been removed, the node cannot re-join cluster anymore, it continuously discovery master, even it could discovery the right list, but it could not join:
And master node continuously handle the node joining and removing tasks in few seconds:
Until we restart the data node manually, it re-joins success. |
No, this issue is only transient, it doesn't explain a persistent failure to rejoin the cluster. |
@DaveCTurner thanks. Any idea that data node could discovery entire dedicated master nodes and could not join cluster? Also the received cluster state has complete master nodes. The node has been removed due to high disk I/O util issue previously, after node removed by master, I/O util becomes normal. It should not have network issue since manually restart the node could join success. |
I don't have an idea how this might happen, but I certainly would like to. It sounds like a bug somewhere, maybe in Elasticsearch. I have seen other cases where a restart fixed a recurring disconnection that was caused by a badly-behaved router on the network between nodes. I should point out that you're using a version that's well past EOL - I don't think I have a working development environment for this version any more, so you'd need to reproduce it in a supported version first. If you can then please open a separate bug report and we'll talk about how to diagnose the problems further. |
The following process will match some cases:
If master could cancel the corresponding ongoing scheduled checker when removing from followerCheckers? |
|
Yes. what i explain above doesn't get to the point. In product, we use the version(v7.9.1), and it lasts for a long time, not is transient, as @howardhuanghua describes. As @DaveCTurner describes in detail, the reason why master continuously handle the node joining and removing requests is that the master node is removing the data node from the cluster, meanwhile the removing data node is waiting to rejoin the cluster, we should avoid the situation. The removed data node also print such logs:
The data node fails to join the cluster because of timeout, but the master will continue think the join request of a valid request, and will send a valid request to the data node. the data node receives the check request and check as follows: elasticsearch/server/src/main/java/org/elasticsearch/cluster/coordination/JoinHelper.java Line 131 in 87ca6fa
Check items are simple, whether the data node is in the cluster or not, it will quickly response to master. That is, the data node has given up joining the cluster, but master accepts the join request, and will publish |
This appears to be unrelated to the problem described in this issue. You seem to be saying that joins continuously encounter a timeout, but this issue is about nodes disconnecting immediately during the joining process. In addition the join timeout is ignored since 7.10.0 (#60872) so I don't think it's possible to see the effect you describe any more. |
The more detail logs from master is as follow:
We can clearly see that: Logically, the connection with the data node should be built in elasticsearch/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java Line 448 in 87ca6fa
First why master throws NodeNotConnectedException when process Logically there should exist the connection, In fact:
Second why master continuously handle the node joining and removing requests? The key point is that the master frequently receives join requests from data node, and doesn't check whether the join requests are still valid, the ignored join timeout seems doesn't work here. What bothers me is why the join requests are timeout. |
You're using 7.9, the timeout was removed in 7.10. |
I mean in 7.9, why so many join requests are timeout. I will continue to observe the reason. Deleting the timeout in v7.10 only ensure that the join request is valid, even if in v7.10, there will be still many join requests from the same data node, this situation may still exist: the Should we add reference counter to the connection to avoid mistakenly closing connection? |
I don't think this is the case, with the timeout removed there is only ever one join request (to each master) in flight at once since we wait for each one to complete before sending another. I think this means we don't get into this strange state of repeatedly trying to join the master while the master is concurrently disconnecting from us. I mean there's something else wrong in your system if the master can't process join requests within 60s for an extended period of time, but at least with this change it doesn't loop like you describe any more it simply waits for the master to catch up. |
IIRC the reason we removed the timeout was because we saw this loop happening and realised that the timeout (plus an egregiously-overloaded master) was causing it. You should be able to get pretty much the same effect in 7.9 by setting the join timeout to something very long (e.g.
Yeah that was the fix I was thinking of here too, but note that it would merely hide the slow-joins problem in your cluster. |
There is indeed only one ongoing request at the same time, I got it wrong at first. The case happened again this afternoon, it lasted for 10 minutes. I changed log level to
We can see that data node indeed frequently send join requests to master in a continuous period of time(The join requests received by the master can also verify this), we have reason to believe that in every join request, the following process is successfull: There indeed exist so many valid join requests in a short time. Timeout(60s) is not the main reason, The frequency of occurrence is low. Removing timeout is not enough, it seems that we should add reference counter to the connection in master to avoid this. If you want more detail logs to verify, I am very happy to provide. |
Hmm this does sound suspicious. Could you share the complete logs for the master node and the data node covering the whole outage? |
As at same times, there are more than one data node removing from the cluster, and the corresponding log level is
Data node log:
If the logs above doesn't help you, and you know where I can upload the log file, please tell me. |
The process is reasonable: The process keeps looping. |
Ahh ok I think I see now, thanks. That does look like this problem now. I am surprised to see it looping like this (without a timeout) but it looks like things are happening in quite a different order in your system from the order I thought we enforced. Turns out we aren't enforcing it, we're just relying on cluster state updates happening quicker than they do for you. I've had a quick look at implementing refcounting to prevent this. We don't actually have anything appropriate to refcount today - we can't prevent the |
This test seems to reproduce the loop fairly reliably: diff --git a/server/src/internalClusterTest/java/org/elasticsearch/discovery/ClusterDisruptionIT.java b/server/src/internalClusterTest/java/org/elasticsearch/discovery/ClusterDisruptionIT.java
index 042d2e91089..df05b9f1084 100644
--- a/server/src/internalClusterTest/java/org/elasticsearch/discovery/ClusterDisruptionIT.java
+++ b/server/src/internalClusterTest/java/org/elasticsearch/discovery/ClusterDisruptionIT.java
@@ -16,19 +16,22 @@ import org.elasticsearch.action.NoShardAvailableActionException;
import org.elasticsearch.action.get.GetResponse;
import org.elasticsearch.action.index.IndexRequestBuilder;
import org.elasticsearch.action.index.IndexResponse;
+import org.elasticsearch.action.support.PlainActionFuture;
import org.elasticsearch.client.Client;
import org.elasticsearch.cluster.ClusterState;
import org.elasticsearch.cluster.action.shard.ShardStateAction;
import org.elasticsearch.cluster.coordination.ClusterBootstrapService;
+import org.elasticsearch.cluster.coordination.FollowersChecker;
import org.elasticsearch.cluster.coordination.LagDetector;
import org.elasticsearch.cluster.metadata.IndexMetadata;
import org.elasticsearch.cluster.routing.Murmur3HashFunction;
import org.elasticsearch.cluster.routing.ShardRouting;
import org.elasticsearch.cluster.routing.ShardRoutingState;
+import org.elasticsearch.cluster.service.ClusterService;
import org.elasticsearch.common.settings.Settings;
-import org.elasticsearch.core.TimeValue;
import org.elasticsearch.common.util.concurrent.ConcurrentCollections;
import org.elasticsearch.common.xcontent.XContentType;
+import org.elasticsearch.core.TimeValue;
import org.elasticsearch.index.VersionType;
import org.elasticsearch.index.shard.IndexShard;
import org.elasticsearch.index.shard.IndexShardTestCase;
@@ -40,6 +43,8 @@ import org.elasticsearch.test.disruption.NetworkDisruption.Bridge;
import org.elasticsearch.test.disruption.NetworkDisruption.TwoPartitions;
import org.elasticsearch.test.disruption.ServiceDisruptionScheme;
import org.elasticsearch.test.junit.annotations.TestIssueLogging;
+import org.elasticsearch.test.transport.MockTransportService;
+import org.elasticsearch.transport.TransportService;
import java.util.ArrayList;
import java.util.Collections;
@@ -59,6 +64,9 @@ import java.util.stream.IntStream;
import static org.elasticsearch.action.DocWriteResponse.Result.CREATED;
import static org.elasticsearch.action.DocWriteResponse.Result.UPDATED;
+import static org.elasticsearch.cluster.coordination.LeaderChecker.LEADER_CHECK_INTERVAL_SETTING;
+import static org.elasticsearch.cluster.coordination.LeaderChecker.LEADER_CHECK_RETRY_COUNT_SETTING;
+import static org.elasticsearch.discovery.PeerFinder.DISCOVERY_FIND_PEERS_INTERVAL_SETTING;
import static org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked;
import static org.hamcrest.Matchers.equalTo;
import static org.hamcrest.Matchers.everyItem;
@@ -494,4 +502,60 @@ public class ClusterDisruptionIT extends AbstractDisruptionTestCase {
}
}
+ public void testRejoinWhileBeingRemoved() {
+ final String masterNode = internalCluster().startMasterOnlyNode();
+ final String dataNode = internalCluster().startDataOnlyNode(Settings.builder()
+ .put(DISCOVERY_FIND_PEERS_INTERVAL_SETTING.getKey(), "100ms")
+ .put(LEADER_CHECK_INTERVAL_SETTING.getKey(), "100ms")
+ .put(LEADER_CHECK_RETRY_COUNT_SETTING.getKey(), "1")
+ .build());
+
+ final ClusterService masterClusterService = internalCluster().getInstance(ClusterService.class, masterNode);
+ final PlainActionFuture<Void> removedNode = new PlainActionFuture<>();
+ masterClusterService.addListener(clusterChangedEvent -> {
+ if (removedNode.isDone() == false && clusterChangedEvent.state().nodes().getDataNodes().isEmpty()) {
+ removedNode.onResponse(null);
+ }
+ });
+
+ final ClusterService dataClusterService = internalCluster().getInstance(ClusterService.class, dataNode);
+ final PlainActionFuture<Void> failedLeader = new PlainActionFuture<>() {
+ @Override
+ protected boolean blockingAllowed() {
+ // we're deliberately blocking the cluster applier on the master until the data node starts to rejoin
+ return true;
+ }
+ };
+ final AtomicBoolean dataNodeHasMaster = new AtomicBoolean(true);
+ dataClusterService.addListener(clusterChangedEvent -> {
+ dataNodeHasMaster.set(clusterChangedEvent.state().nodes().getMasterNode() != null);
+ if (failedLeader.isDone() == false && dataNodeHasMaster.get() == false) {
+ failedLeader.onResponse(null);
+ }
+ });
+
+ masterClusterService.addHighPriorityApplier(event -> {
+ failedLeader.actionGet();
+ if (dataNodeHasMaster.get() == false) {
+ try {
+ Thread.sleep(100);
+ } catch (InterruptedException e) {
+ throw new AssertionError("unexpected", e);
+ }
+ }
+ });
+
+ final MockTransportService dataTransportService
+ = (MockTransportService) internalCluster().getInstance(TransportService.class, dataNode);
+ dataTransportService.addRequestHandlingBehavior(FollowersChecker.FOLLOWER_CHECK_ACTION_NAME, (handler, request, channel, task) -> {
+ if (removedNode.isDone() == false) {
+ channel.sendResponse(new ElasticsearchException("simulated check failure"));
+ } else {
+ handler.messageReceived(request, channel, task);
+ }
+ });
+
+ removedNode.actionGet(10, TimeUnit.SECONDS);
+ ensureStableCluster(2);
+ }
} |
Yes, I’m trying to fix it. As long as the IO pressure is relatively high, it may happen, usually once every few days, the temporary solution is to reduce the IO pressure by adding the data nodes.
If the join requests aways comes first, rejecting joins doesn't seem to work, in this case, unless the It seems that adding refcounting in |
@DaveCTurner, Have you started working on it ? If not, it's my pleasure try to fix it, I will try it by adding refcounting. |
I have been exploring a few ideas; it is definitely possible to block joins while a disconnection is pending (either reject them or just wait for the disconnection to complete) but it's not very pretty. I think I can see a way to do it with refcounting now: it looks possible to make A colleague points out that this will change the behaviour in the case that something starts dropping packets on a PING connection: we won't close the connection any more, we'll keep on trying to re-use it until the TCP retransmission timeout kicks in, which defaults to 15 minutes. That's the sort of thing we see sometimes with badly-configured firewalls, but we could solve it by verifying the PING connection while validating the join. It's unlikely I'll have time to dig deeper before the middle of next week now, if you think you'll make progress in the next few days then please go ahead, hopefully we can collaborate on a PR. |
It's my honor to cooperate with you, and thank you very much for your ideas, I will try to fix it in your ideas. |
@DaveCTurner, I just try to solve it, which is different from what you said, and it may not meet your requirement, so I don't add unit test. If you can provide suggestions, I would be happy to fix, or If there is a better way to implement it, I't glad to see it. |
Today it's possible to open two connections to a node, and then we notice when registering the second connection and close it instead. Fixing elastic#67873 will require us to keep tighter control over the identity and lifecycle of each connection, and opening redundant connections gets in the way of this. This commit adds a check for an existing connection _after_ marking the connection as pending, which guarantees that we don't open those redundant connections.
Today it's possible to open two connections to a node, and then we notice when registering the second connection and close it instead. Fixing #67873 will require us to keep tighter control over the identity and lifecycle of each connection, and opening redundant connections gets in the way of this. This commit adds a check for an existing connection _after_ marking the connection as pending, which guarantees that we don't open those redundant connections.
Today it's possible to open two connections to a node, and then we notice when registering the second connection and close it instead. Fixing #67873 will require us to keep tighter control over the identity and lifecycle of each connection, and opening redundant connections gets in the way of this. This commit adds a check for an existing connection _after_ marking the connection as pending, which guarantees that we don't open those redundant connections.
Today we open connections to other nodes in various places and largely assume that they remain open as needed, only closing them when applying a cluster state that removes the remote node from the cluster. This isn't ideal: we might preserve unnecessary connections to remote nodes that aren't in the cluster if they never manage to join the cluster, and we might also disconnect from a node that left the cluster while it's in the process of re-joining too (see #67873). With this commit we move to a model in which each user of a connection to a remote node acquires a reference to the connection that must be released once it's no longer needed. Connections remain open while there are any live references, but are now actively closed when all references are released. Fixes #67873
Today we open connections to other nodes in various places and largely assume that they remain open as needed, only closing them when applying a cluster state that removes the remote node from the cluster. This isn't ideal: we might preserve unnecessary connections to remote nodes that aren't in the cluster if they never manage to join the cluster, and we might also disconnect from a node that left the cluster while it's in the process of re-joining too (see elastic#67873). With this commit we move to a model in which each user of a connection to a remote node acquires a reference to the connection that must be released once it's no longer needed. Connections remain open while there are any live references, but are now actively closed when all references are released. Fixes elastic#67873 Backport of elastic#77295
Today we open connections to other nodes in various places and largely assume that they remain open as needed, only closing them when applying a cluster state that removes the remote node from the cluster. This isn't ideal: we might preserve unnecessary connections to remote nodes that aren't in the cluster if they never manage to join the cluster, and we might also disconnect from a node that left the cluster while it's in the process of re-joining too (see #67873). With this commit we move to a model in which each user of a connection to a remote node acquires a reference to the connection that must be released once it's no longer needed. Connections remain open while there are any live references, but are now actively closed when all references are released. Fixes #67873 Backport of #77295
Hi @DaveCTurner , in @kkewwei shared data node's log, we could see the data node sends join request continuously to master node. It seems its Normally, if node cannot find master then becomes candidate, Do you have any idea that why above data node continuously send join request? |
Yes, it's because the master is repeatedly dropping the connection to the data node. The data node gets told "yes you have joined" but it never receives the cluster state update that adds it to the cluster, so it tries again. |
Hi @DaveCTurner , a simple question need you help that why we set most of the coordination layer parameters as |
Making settings dynamic adds substantial complexity. Moreover we strongly recommend not adjusting the settings you mention at all, let alone adjusting them dynamically. If the defaults don't work for you then there's something else wrong in your system, maybe a bug or maybe something environmental. We made these settings so we had a way to implement a temporary workaround for a bug if one were found, but you shouldn't have them set permanently, let alone need to adjust them dynamically. Let's investigate the underlying problem you're trying to solve by adjusting these settings, either in a new Github issue or in a support case. |
@DaveCTurner , thanks for the patient reply. Since in this case, we try to increase |
That sounds right to me, and yes we will not be backporting #77672 to versions before 7.16.0. Just to emphasise that we are still very interested in understanding the underlying slowness in your clusters. We have been profiling some of our own high-shard-count clusters recently and have found a number of places that deserve optimisation (you may have noticed a few PRs on the subject). We're very receptive to any insights you can share from your own profiling. |
Thanks @DaveCTurner . In this case, the main issue is that robust/big search requests caused resource bottle neck like IO/CPU, then caused partial of the nodes leave temporarily (like follower lagging). It would come into above left/join loop for a long time. On another hand, we are optimizing cluster state updating performance for high-shard-count clusters, just like this PR #77266, it's only a piece of the whole optimization process. Would you please help to list the related shard scalability optimizations that you are working on? We could do further cooperation in the future. |
For instance we'd like to understand why you are seeing followers leaving due to lagging at all - that means it took over 2 minutes to process a cluster state, which is a very long time even on a heavily loaded node. Master-ineligible nodes should be able to process a new cluster state without doing any IO at all, and shouldn't need much CPU either, so if you are seeing resource bottlenecks delaying these things for >2 minutes then that sounds like a bug. There's no overarching list of tasks, we are just profiling various scenarios and addressing any unnecessary bottlenecks that we find. |
We've seen IO util 100% last for several mins before in one of our public cloud ES cluster node, that's a cold node with HDD. ES version is 7.5, node cluster state persistent would call
Would you please share these related PRs? |
If a node leaves the cluster with
reason: disconnected
then this means something closed one of the TCP channels initiated by the master targeting the departing node. Nodes never deliberately close any incoming connections, and the master doesn't spontaneously close any outgoing connections, so we usually considerreason: disconnected
to be evidence of a network issue.However, I recently observed a
node-left
event withreason: disconnected
on a cluster running entirely onlocalhost
, which therefore rules out a network issue. This was the second of twonode-left
events for the same node within a few seconds; the first had reasonfollowers check retry count exceeded
. I have so far failed to reproduce this situation.My best guess regarding the cause is that we hadn't closed everything by the time the node rejoins, so the
node-join
task executes and only then does the channel get closed and the master notified that this node disconnected. (io.netty.channel.AbstractChannelHandlerContext#close()
is fire-and-forget, it just adds a task to be processed by the event loop, but that particular event loop was blocked so the closing was delayed).I don't think the behaviour is particularly bad, the node already left the cluster, but the bug is in how we report this as a network event when really the network is fine.
The text was updated successfully, but these errors were encountered: