Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Histogram of Transport Worker Time that is Spent per-Message #80428

Closed
Tracked by #77466
original-brownbear opened this issue Nov 5, 2021 · 1 comment · Fixed by #80581
Closed
Tracked by #77466

Add a Histogram of Transport Worker Time that is Spent per-Message #80428

original-brownbear opened this issue Nov 5, 2021 · 1 comment · Fixed by #80581
Labels
:Distributed Coordination/Network Http and internode communication implementations >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@original-brownbear
Copy link
Member

We'd like to add information to TransportStats (and thus the node stats) that gives us insight into the performance of transport threads and whether or not they might get blocked by a heavy task (like deserializing a large message) for too long.
This should be easy to build by tracking the existing timings recorded by slow-logging in InboundHandler and OutboundHandler. We do not need a very sophisticated histogram here. We effectively only care about recording the number of problematic messages and a rough idea of long they take so we can make due with a couple of fixed buckets to count timings into.
I would suggest we record the following separately for requests and responses (just powers of 2):
<2ms, <4ms, <8ms, <16ms and so on up to <65536ms (and one more for everything longer than that). This give us a good measure of how much time we spent on the transport thread for the trivial cost of 17 counters times two.
We can then add those numbers to the TransportStats message and it's serialization into node stats.

We mainly need this for benchmarking in #77466 but this should be quite useful in debugging as well.

relates and asks for part of #36127
relates #77466

@original-brownbear original-brownbear added >enhancement :Distributed Coordination/Network Http and internode communication implementations labels Nov 5, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 5, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Nov 10, 2021
Adds to the transport node stats a record of the distribution of the
times for which a transport thread was handling a message, represented
as a histogram.

Closes elastic#80428
DaveCTurner added a commit that referenced this issue Nov 29, 2021
Adds to the transport node stats a record of the distribution of the
times for which a transport thread was handling a message, represented
as a histogram.

Closes #80428
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants