-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client-side stats collection timeouts can result in overloaded master #60188
Comments
Pinging @elastic/es-core-features (:Core/Features/Monitoring) |
@DaveCTurner Thanks for looking into it. I think the cluster should be able to protect itself independent of the monitoring entity in which case timeouts rescue the coordinating node. We can additionally also cancel tasks on the slow node on timeouts since we don't care about their responses beyond the timeout. Thoughts? |
Thanks @DaveCTurner for raising this. We discussed this today and agree that this is an issue, but it may be a bit more complex then just removing the client side timeout. Some context: There is an effort to replace these "internal" collectors with metricbeat based collectors which run out of process and have slightly different characteristics but the same data. Any large efforts should be geared to ensuring that the metricbeat approach is as effective as possible. The "internal" collectors run on each node, and for any of the collections that require fanning out to the individual node, that is handled on the master node. I am not clear on why this decision was made as opposed to simply letting each node handle its own metric avoiding intra-cluster communication all together. (I am sure there is a valid reason, but those decisions pre-date my time here). To the best of my knowledge metricbeat is designed to run as a side car process for each instance of ES and always (?? need to double check ??) uses the I think if it were not for metricbeat we should revist the decision on why intra-cluster communication is needed at all for internal collection, and go from there. However, I think time might be spent ensuring that the metric beat approach is optimimal and can't result in memory/performance issues akin to the one mentioned here. One reason that I am cautious (hesitant?) to remove the client side timeouts is what happens on the master node if it sends a request and never hears back ? I don't think we have any implicit transport layer timeouts, do we ? Would we end up blocking indefinitely ? Most of these actions are backed by |
Indeed, the question for Metricbeat (and other HTTP clients) is covered in #55550. Today Metricbeat also times out after 10 seconds and immediately retries, and at least some of its collections run only on the elected master (!) As I noted in #55550 my preference there would be to treat a closed HTTP connection as a cancellation, like we do for searches, since HTTP clients need an end-to-end timeout anyway and this would let us respect that. There are a couple of differences between the internal monitoring client and an external HTTP client like Metricbeat that need a different approach for the monitoring client. Firstly, there's no equivalent to "close the HTTP connection" to signal that the request is no longer needed. Secondly, transport requests always get an eventual response, even without an explicit per-request timeout (mod bugs ofc) because they react to the health of the underlying TCP connection (as determined by all the other requests on the same connection, or keepalives in the case of extremely low traffic). (That said, HTTP requests also always get a response in some form because of the same TCP-level mechanisms and the fact that the coordinating node always responds as long as the network is healthy (mod bugs ofc), assuming that they configure keepalives properly, but that's a separate debate) I'm -0 on moving the timeout to NB -0 means I have only a very slight preference for not doing it. The linked issues would certainly have been less severe if |
Adrien just pointed out something that I had missed earlier: if we simply remove these timeouts then that'll halt all the other collectors too, not just the one that's blocked on a slow node, because we run the collections in series, block the collecting thread to wait for each response in turn, and don't schedule the next collection until the previous cycle has completed 😢 |
we have met same problems when a datanode has been restarted and recoveried, the master node is not able to collect indice/stats and cluster/stats and index/recovery, from the datanode, and then oom, but network and disk IO are both good on the datanode with io stat information; and when a new master has been elected, there is no such problem and nothing has been done with datanode. Is there some problem on connection which collecting stats? And some scene has been reappeared? |
With elastic#66993 there is now support for coordinator-side timeouts on a `BroadcastRequest`, which includes requests for node stats and recoveries. This commit adjusts Monitoring to use these coordinator-side timeouts where applicable, which will prevent partial stats responses from accumulating on the master while one or more nodes are not responding quickly enough. It also enhances the message logged on a timeout to include the IDs of the nodes which did not respond in time. Closes elastic#60188.
With #66993 there is now support for coordinator-side timeouts on a `BroadcastRequest`, which includes requests for node stats and recoveries. This commit adjusts Monitoring to use these coordinator-side timeouts where applicable, which will prevent partial stats responses from accumulating on the master while one or more nodes are not responding quickly enough. It also enhances the message logged on a timeout to include the IDs of the nodes which did not respond in time. Closes #60188.
With #66993 there is now support for coordinator-side timeouts on a `BroadcastRequest`, which includes requests for node stats and recoveries. This commit adjusts Monitoring to use these coordinator-side timeouts where applicable, which will prevent partial stats responses from accumulating on the master while one or more nodes are not responding quickly enough. It also enhances the message logged on a timeout to include the IDs of the nodes which did not respond in time. Closes #60188.
Today the monitoring subsystem collects stats from a cluster with a client-side timeout, e.g.:
elasticsearch/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/cluster/ClusterStatsCollector.java
Line 87 in 8b7c556
elasticsearch/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/indices/IndexStatsCollector.java
Line 74 in 8b7c556
elasticsearch/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/indices/IndexRecoveryCollector.java
Line 73 in 8b7c556
elasticsearch/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/ml/JobStatsCollector.java
Line 76 in 8b7c556
This timeout is configurable for each collector and defaults to 10 seconds:
elasticsearch/x-pack/plugin/monitoring/src/main/java/org/elasticsearch/xpack/monitoring/collector/Collector.java
Lines 174 to 177 in 8b7c556
Handlers of stats requests generally reach out to all the nodes in the cluster, collect their responses, and once all nodes have responded they send a summary of the results to the originating client. These responses can be rather large, perhaps 10s of MBs, and this data all lives on-heap on the coordinating node until every node has responded.
The problem with the client-side timeouts that monitoring uses is that they do not clean up the partial results held on the coordinating node. If one node stops responding to stats requests for a while then monitoring will retry adding new handlers with 10s of MBs more heap usage to the coordinating node every 10 seconds.
I think we should remove these client-side timeouts so that we avoid the accumulation of on-heap junk caused by these retries. If we feel that the timeout/retry behaviour is necessary then I think we should move it into the server so that it can clean up properly on a failure (relates #52616).
The text was updated successfully, but these errors were encountered: