-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connect to new nodes concurrently #22984
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left one question... lgtm otherwise
// skip | ||
} | ||
|
||
@Override | ||
public void disconnectFromNodesExcept(Iterable<DiscoveryNode> nodesToKeep) { | ||
public void disconnectFromNodesExcept(DiscoveryNodes nodesToKeep) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we change these? Isn't Iterable more generic and would still accept DiscoveryNodes? also disconnectFromNodesExcept
sounds like it would only be executed on a subset of and actual DiscoveryNodes
instance. I think we should stick with what we had?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we change these? Isn't Iterable more generic and would still accept DiscoveryNodes?
I can change it back, but I did it to be consistent with connectToNodes which needs the total size of the collection. Since this is only used with DiscoNodes, I felt consistency is clearer. As said - I'll roll it back if you prefer.
disconnectFromNodesExcept sounds like it would only be executed on a subset of and actual DiscoveryNodes instance
Naming are hard. What it does is disconnecting from all nodes that are not part of the supplied disconodes parameter. We pass the disco nodes of the current cluster state directly, without any modification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough lets do it.
thx @s1monw |
When a node receives a new cluster state from the master, it opens up connections to any new node in the cluster state. That has always been done serially on the cluster state thread but it has been a long standing TODO to do this concurrently, which is done by this PR. This is spin off of #22828, where an extra handshake is done whenever connecting to a node, which may slow down connecting. Also, the handshake is done in a blocking fashion which triggers assertions w.r.t blocking requests on the cluster state thread. Instead of adding an exception, I opted to implement concurrent connections which both side steps the assertion and compensates for the extra handshake.
When a node receives a new cluster state from the master, it opens up connections to any new node in the cluster state. That has always been done serially on the cluster state thread but it has been a long standing TODO to do this concurrently, which is done by this PR.
This is spin off of #22828, where an extra handshake is done whenever connecting to a node, which may slow down connecting. Also, the handshake is done in a blocking fashion which triggers assertions w.r.t blocking requests on the cluster state thread. Instead of adding an exception, I opted to implement concurrent connections which both side steps the assertion and compensates for the extra handshake.
The change is well tested under current tests. Sadly, I could find an easy way to simulate rejections as we have locked down the thread pool settings (good!) and the management still has an unbound queue.