Fix Leaking Listener When Closing NodeClient #55676

original-brownbear · 2020-04-23T14:21:20Z

If a node client (or rather its underlying node) is closed then any executions on it will just quietly fail as happens in #55660 via closing the nodes on the test thread and asynchronously using
a node client.
I also removed the onRejection implementation on the error handling runnable since it's dead code (generic pool never rejects).
IMO, it's reasonable to handle the callback on the current thread when shutting down if only to fix tests and remove a dirty spot that loses listeners.
The only way you can really trigger a SO here is by rerunning requests in a hot loop which you shouldn't do in the first place. ES internally doesn't do that in its usages of the node client as far as I can see and we shouldn't do that in tests either (and don't as far as I can see).

Closes #55660

If a node client (or rather its underlying node) is closed then any executions on it will just quietly fail as happens in elastic#55660 via closing the nodes on the test thread and asynchroneously using a node client. Closes elastic#55660

elasticmachine · 2020-04-23T14:21:22Z

Pinging @elastic/es-distributed (:Distributed/Network)

ywelsch

I'm undecided whether we should do this change. The shut down behavior of ES / transport service makes sense as it is, but the integration tests (with the node client) are directly reaching into the node's innards to execute stuff, with requests in particular not being routed through the transport service (relates also #55494).

original-brownbear · 2020-04-24T11:53:58Z

@ywelsch annoyingly enough, the problem here is precisely the fact that this request goes through the TransportService for the local node (because it's a broadcast request) and that then never finishes.
In other tests, we're saved from this problem because most of the requests we do are TMA which will throw a NodeClosedException when running into a closed node.
I see the point that this is a test-only issue. On the other hand, all transport master node actions will actually throw the node closed exception on the current thread as well ... so you could argue that this change at least makes the NodeClient behavior consistent in that sense I guess?

We had this issue before (service does cleanup/exception-handling on the generic pool) in #46178 where we fixed it by waiting for all actions to finish when closing the affected service. This isn't really a fix here though because we simply have to do something about local requests to a shut-down transport service IMO.
The core of the issue IMO, is simply that a stopped transport service (if its TP is shut down as well) will simply quietly swallow all local requests' listeners. This seems wrong (even if the only practical problem is that it's very trappy in tests) and IMO my change is the only possible fix, you simply have to resolve the listener on the current thread in this case because you can't rely on the pool?

ywelsch · 2020-04-24T12:00:31Z

I was wondering if we should instead add something to the test infra in InternalTestCluster instead, so that we track all pending requests, and throw a NodeClosedException when the node is closed.

original-brownbear · 2020-04-24T12:06:15Z

We could do that via a FilterClient I think (just have it wrap all listeners with execute once and track them), that wouldn't be too hard either. I still find it weird though, that a closed transport service would become a black hole for listeners? I'm fine fixing it in the client though since it's a test only issue in practice :)

ywelsch · 2020-04-24T13:34:46Z

Let's see what @tbrooks8 thinks.

I still find it weird though, that a closed transport service would become a black hole for listeners?

It's not that a closed transport service is becoming a black hole for listeners when the transport service is shut down (we handle that correctly AFAICS), but that when the threadpools are shut down, there is no guarantee that all tasks on the thread pool (queued up or not) will have their listeners invoked. That's a guarantee we don't have today, and has nothing to do with the transport.

original-brownbear · 2020-04-24T14:06:04Z

we handle that correctly AFAICS

Not for local node requests. A stopped transport will simply will always get to org.elasticsearch.transport.TransportService#sendRequestInternal, that will throw NodeClosedException , which is then handled on the generic pool (i.e. never if the pool is shut down as well). For the other non-local connections we have some handling around closing them that is good enough (and executes not connected exception straight on the calling thread) but for a local node request we don't. Another alternative here would be to just add special case handling for local node requests and throw for those right when getting the connection if the transport is shut down to make local and remote requests behave the same. That should cover all cases as well.

To be honest, the fact that we handle some exceptions NodeNotConnected straight on the calling thread (https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/transport/TransportService.java#L530), but pass others exceptions on to the generic pool seems pointless (if SO was an issue, then why isn't it for node not connected?).
=> since we never see any SO issues from node not connected (at least I haven't) I wonder if we shouldn't just get rid of the forking to the generic pool in the error handling here in all cases and keep things simple?

ywelsch

LGTM

If a node client (or rather its underlying node) is closed then any executions on it will just quietly fail as happens in elastic#55660 via closing the nodes on the test thread and asynchronously using a node client. Closes elastic#55660

If a node client (or rather its underlying node) is closed then any executions on it will just quietly fail as happens in #55660 via closing the nodes on the test thread and asynchronously using a node client. Closes #55660

Fix Leaking Listener When Closing NodeClient

721c55c

If a node client (or rather its underlying node) is closed then any executions on it will just quietly fail as happens in elastic#55660 via closing the nodes on the test thread and asynchroneously using a node client. Closes elastic#55660

original-brownbear added >non-issue :Distributed Coordination/Network Http and internode communication implementations v8.0.0 v7.8.0 labels Apr 23, 2020

original-brownbear mentioned this pull request Apr 23, 2020

[CI] org.elasticsearch.discovery.DiskDisruptionIT.testGlobalCheckpointIsSafe #55660

Closed

original-brownbear requested review from ywelsch and Tim-Brooks April 23, 2020 15:16

ywelsch reviewed Apr 24, 2020

View reviewed changes

Merge remote-tracking branch 'elastic/master' into 55660

e253019

ywelsch approved these changes Apr 28, 2020

View reviewed changes

original-brownbear added 2 commits April 28, 2020 14:53

Merge remote-tracking branch 'elastic/master' into 55660

8c48bfe

CR: bring back rejection handler

fc0e44d

original-brownbear merged commit d7bc3dd into elastic:master Apr 28, 2020

original-brownbear deleted the 55660 branch April 28, 2020 14:38

original-brownbear mentioned this pull request Apr 28, 2020

Fix Leaking Listener When Closing NodeClient (#55676) #55864

Merged

original-brownbear restored the 55660 branch August 6, 2020 18:22

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Leaking Listener When Closing NodeClient #55676

Fix Leaking Listener When Closing NodeClient #55676

original-brownbear commented Apr 23, 2020

elasticmachine commented Apr 23, 2020

ywelsch left a comment

original-brownbear commented Apr 24, 2020

ywelsch commented Apr 24, 2020

original-brownbear commented Apr 24, 2020

ywelsch commented Apr 24, 2020

original-brownbear commented Apr 24, 2020

ywelsch left a comment

Fix Leaking Listener When Closing NodeClient #55676

Fix Leaking Listener When Closing NodeClient #55676

Conversation

original-brownbear commented Apr 23, 2020

elasticmachine commented Apr 23, 2020

ywelsch left a comment

Choose a reason for hiding this comment

original-brownbear commented Apr 24, 2020

ywelsch commented Apr 24, 2020

original-brownbear commented Apr 24, 2020

ywelsch commented Apr 24, 2020

original-brownbear commented Apr 24, 2020

ywelsch left a comment

Choose a reason for hiding this comment