Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce dedicated threadpool for establishing connections #29023

Closed
DaveCTurner opened this issue Mar 13, 2018 · 2 comments
Closed

Introduce dedicated threadpool for establishing connections #29023

DaveCTurner opened this issue Mar 13, 2018 · 2 comments
Labels

Comments

@DaveCTurner
Copy link
Contributor

Today, we attempt to connect to nodes concurrently using the management threadpool:

threadPool.executor(ThreadPool.Names.MANAGEMENT).execute(new AbstractRunnable() {
@Override
public void onFailure(Exception e) {
// both errors and rejections are logged here. the service
// will try again after `cluster.nodes.reconnect_interval` on all nodes but the current master.
// On the master, node fault detection will remove these nodes from the cluster as their are not
// connected. Note that it is very rare that we end up here on the master.
logger.warn((Supplier<?>) () -> new ParameterizedMessage("failed to connect to {}", node), e);
}
@Override
protected void doRun() throws Exception {
try (Releasable ignored = nodeLocks.acquire(node)) {
validateAndConnectIfNeeded(node);
}
}
@Override
public void onAfter() {
latch.countDown();
}
});

Connection establishment can be time-consuming if the remote node is unresponsive, and the management threadpool is small and important, so saturating it with attempts to connect to unresponsive nodes is undesirable.

The suggested fix is to create a separate threadpool purely for establishing node-to-node connections instead. As such connections are mostly long-lived the new-connection threadpool will mostly be idle, but after a network partition it would be good for each node to try and re-establish connections to its peers using a lot more concurrency than the management threadpool can support.

Relates #28920 in which cluster state application is blocked for multiple minutes because, in part, of insufficient concurrency when attempting to connect to unresponsive peers.

@DaveCTurner DaveCTurner added help wanted adoptme :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Mar 13, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 24, 2018
Today we attempt to (re-)connect to our peers using the management threadpool.
However, during a network partition there may sometimes be a large number of
concurrent connection attempts. Connection attempts to partitioned nodes or to
nodes in containers that are no longer running can hang until they timeout,
possibly blocking other reconnection attempts and other management activity for
an extended period of time.  Moreover, connecting to a peer is a relatively
lightweight operation so it is reasonable to attempt a lot of them in parallel.

This change introduces a separate threadpool solely for connecting to peers.

Fixes elastic#29023.
@lcawl lcawl added v6.4.1 and removed v6.4.0 labels Aug 23, 2018
Tim-Brooks added a commit that referenced this issue Nov 7, 2018
This is related to #29023. Additionally at other points we have
discussed a preference for removing the need to unnecessarily block
threads for opening new node connections. This commit lays the groudwork
for this by opening connections asynchronously at the transport level.
We still block, however, this work will make it possible to eventually
remove all blocking on new connections out of the TransportService
and Transport.
Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this issue Nov 7, 2018
This is related to elastic#29023. Additionally at other points we have
discussed a preference for removing the need to unnecessarily block
threads for opening new node connections. This commit lays the groudwork
for this by opening connections asynchronously at the transport level.
We still block, however, this work will make it possible to eventually
remove all blocking on new connections out of the TransportService
and Transport.
Tim-Brooks added a commit that referenced this issue Nov 7, 2018
This is related to #29023. Additionally at other points we have
discussed a preference for removing the need to unnecessarily block
threads for opening new node connections. This commit lays the groudwork
for this by opening connections asynchronously at the transport level.
We still block, however, this work will make it possible to eventually
remove all blocking on new connections out of the TransportService
and Transport.
pgomulka pushed a commit to pgomulka/elasticsearch that referenced this issue Nov 13, 2018
This is related to elastic#29023. Additionally at other points we have
discussed a preference for removing the need to unnecessarily block
threads for opening new node connections. This commit lays the groudwork
for this by opening connections asynchronously at the transport level.
We still block, however, this work will make it possible to eventually
remove all blocking on new connections out of the TransportService
and Transport.
@danielmitterdorfer
Copy link
Member

The work in #35144 is an enabler to open connections asynchronously as an alternative to a dedicated threadpool. Therefore there is no need for a dedicated threadpool anymore and we might only introduce one should we run into unexpected issues with the async approach but for the time being I am closing this issue.

@danielmitterdorfer danielmitterdorfer removed the help wanted adoptme label Dec 5, 2018
@DaveCTurner DaveCTurner added the :Distributed Coordination/Network Http and internode communication implementations label Mar 12, 2019
@DaveCTurner DaveCTurner removed the :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. label Mar 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants