Add a Histogram of Transport Worker Time that is Spent per-Message #80428
Labels
:Distributed Coordination/Network
Http and internode communication implementations
>enhancement
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
We'd like to add information to
TransportStats
(and thus the node stats) that gives us insight into the performance of transport threads and whether or not they might get blocked by a heavy task (like deserializing a large message) for too long.This should be easy to build by tracking the existing timings recorded by slow-logging in
InboundHandler
andOutboundHandler
. We do not need a very sophisticated histogram here. We effectively only care about recording the number of problematic messages and a rough idea of long they take so we can make due with a couple of fixed buckets to count timings into.I would suggest we record the following separately for requests and responses (just powers of 2):
<2ms
,<4ms
,<8ms
,<16ms
and so on up to<65536ms
(and one more for everything longer than that). This give us a good measure of how much time we spent on the transport thread for the trivial cost of 17 counters times two.We can then add those numbers to the
TransportStats
message and it's serialization into node stats.We mainly need this for benchmarking in #77466 but this should be quite useful in debugging as well.
relates and asks for part of #36127
relates #77466
The text was updated successfully, but these errors were encountered: