Introduce dedicated threadpool for establishing connections #29023

DaveCTurner · 2018-03-13T17:03:14Z

Today, we attempt to connect to nodes concurrently using the management threadpool:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/NodeConnectionsService.java

Lines 94 to 115 in 46e16b6

    
           threadPool.executor(ThreadPool.Names.MANAGEMENT).execute(new AbstractRunnable() { 
        
               @Override 
        
               public void onFailure(Exception e) { 
        
                   // both errors and rejections are logged here. the service 
        
                   // will try again after `cluster.nodes.reconnect_interval` on all nodes but the current master. 
        
                   // On the master, node fault detection will remove these nodes from the cluster as their are not 
        
                   // connected. Note that it is very rare that we end up here on the master. 
        
                   logger.warn((Supplier<?>) () -> new ParameterizedMessage("failed to connect to {}", node), e); 
        
               } 
        
               @Override 
        
               protected void doRun() throws Exception { 
        
                   try (Releasable ignored = nodeLocks.acquire(node)) { 
        
                       validateAndConnectIfNeeded(node); 
        
                   } 
        
               } 
        
               @Override 
        
               public void onAfter() { 
        
                   latch.countDown(); 
        
               } 
        
           });

Connection establishment can be time-consuming if the remote node is unresponsive, and the management threadpool is small and important, so saturating it with attempts to connect to unresponsive nodes is undesirable.

The suggested fix is to create a separate threadpool purely for establishing node-to-node connections instead. As such connections are mostly long-lived the new-connection threadpool will mostly be idle, but after a network partition it would be good for each node to try and re-establish connections to its peers using a lot more concurrency than the management threadpool can support.

Relates #28920 in which cluster state application is blocked for multiple minutes because, in part, of insufficient concurrency when attempting to connect to unresponsive peers.

elasticmachine · 2018-03-13T17:03:15Z

Pinging @elastic/es-distributed

Today we attempt to (re-)connect to our peers using the management threadpool. However, during a network partition there may sometimes be a large number of concurrent connection attempts. Connection attempts to partitioned nodes or to nodes in containers that are no longer running can hang until they timeout, possibly blocking other reconnection attempts and other management activity for an extended period of time. Moreover, connecting to a peer is a relatively lightweight operation so it is reasonable to attempt a lot of them in parallel. This change introduces a separate threadpool solely for connecting to peers. Fixes elastic#29023.

This is related to #29023. Additionally at other points we have discussed a preference for removing the need to unnecessarily block threads for opening new node connections. This commit lays the groudwork for this by opening connections asynchronously at the transport level. We still block, however, this work will make it possible to eventually remove all blocking on new connections out of the TransportService and Transport.

This is related to elastic#29023. Additionally at other points we have discussed a preference for removing the need to unnecessarily block threads for opening new node connections. This commit lays the groudwork for this by opening connections asynchronously at the transport level. We still block, however, this work will make it possible to eventually remove all blocking on new connections out of the TransportService and Transport.

This is related to #29023. Additionally at other points we have discussed a preference for removing the need to unnecessarily block threads for opening new node connections. This commit lays the groudwork for this by opening connections asynchronously at the transport level. We still block, however, this work will make it possible to eventually remove all blocking on new connections out of the TransportService and Transport.

This is related to elastic#29023. Additionally at other points we have discussed a preference for removing the need to unnecessarily block threads for opening new node connections. This commit lays the groudwork for this by opening connections asynchronously at the transport level. We still block, however, this work will make it possible to eventually remove all blocking on new connections out of the TransportService and Transport.

danielmitterdorfer · 2018-12-05T14:44:47Z

The work in #35144 is an enabler to open connections asynchronously as an alternative to a dedicated threadpool. Therefore there is no need for a dedicated threadpool anymore and we might only introduce one should we run into unexpected issues with the async approach but for the time being I am closing this issue.

DaveCTurner added help wanted adoptme :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Mar 13, 2018

DaveCTurner added v7.0.0 v6.3.0 labels Mar 13, 2018

DaveCTurner mentioned this issue Mar 13, 2018

Slow recovery of write availability after partition of a large cluster #28920

Closed

colings86 added the >enhancement label Apr 24, 2018

PnPie mentioned this issue Apr 25, 2018

Add a dedicated threadpool for node connections #30150

Closed

bleskes added v6.3.1 v6.4.0 and removed v6.3.0 v6.3.1 labels Apr 26, 2018

DaveCTurner mentioned this issue Jun 24, 2018

Introduce CONNECT threadpool #31546

Closed

lcawl added v6.4.1 and removed v6.4.0 labels Aug 23, 2018

Tim-Brooks mentioned this issue Oct 31, 2018

Open node connections asynchronously #35144

Merged

Tim-Brooks mentioned this issue Nov 7, 2018

Open node connections asynchronously #35343

Merged

danielmitterdorfer closed this as completed Dec 5, 2018

danielmitterdorfer removed the help wanted adoptme label Dec 5, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

DaveCTurner added the :Distributed Coordination/Network Http and internode communication implementations label Mar 12, 2019

DaveCTurner removed the :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. label Mar 12, 2019

DaveCTurner mentioned this issue Mar 18, 2019

Avoid blocking a thread waiting for connections #40150

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce dedicated threadpool for establishing connections #29023

Introduce dedicated threadpool for establishing connections #29023

DaveCTurner commented Mar 13, 2018

elasticmachine commented Mar 13, 2018

danielmitterdorfer commented Dec 5, 2018

Introduce dedicated threadpool for establishing connections #29023

Introduce dedicated threadpool for establishing connections #29023

Comments

DaveCTurner commented Mar 13, 2018

elasticmachine commented Mar 13, 2018

danielmitterdorfer commented Dec 5, 2018