You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sketch of how we could collect metrics for measuring performance inside the event loop.
Histograms
logarithmic bucketing rather than sampling
"Unlike popular metric systems today, this does not destroy the accuracy of
histograms by sampling. Instead, a logarithmic bucketing function compresses
values, generally within 1% of their true value (although between 0 and 1 the
precision loss may not be within this boundary). This allows for extreme
compression, which allows us to calculate arbitrarily high percentiles with no
loss of accuracy - just a small amount of precision. This is particularly
useful for highly-clustered events that are tolerant of a small precision loss,
but for which you REALLY care about what the tail looks like, such as measuring
latency across a distributed system." -- Tyler "spacejam" Neely
Implementation idea: any double can be compressed into one of 2^16 buckets,
with less than 1% compression loss (using the natural logarithm function for
compression and exponentiation for decompression, hence the name logarithmic
bucketing)
throughput - how many operations can be performed in some unit of time
utilization - the proportion of time that a system (server, disk, hashmap,
etc...) is busy handling requests, as opposed to waiting for
the next request to arrive.
saturation - the extent to which requests must queue before being handled
by the system, usually measured in terms of queue depth
(length).
space - whoah.
E.g. one histogram for client request/response latency, another one for
client req saturation (keeping track of what the queue depth was when the
client req arrived), and a counter for throughput
Built into the SUT, deployment agnostic, could be sampled by e.g. prometheus
or anything else?
We already mentioned client req/resp latency, client req saturation and
throughput above;
Main event loop utilisation: keep track of time when finished processing
last event, at the beginning of processing a new event measure the time
between last event finished processing and now, add the difference to the
sum of idle time;
We could do similar to above for async I/O worker thread pool also;
The async I/O work queue can also be measured for saturation same as the
event queue;
Actor utilisation: # of messages sent to actor / # of total messages sent;
Actor space: one crude way would be to check the length of the json string
when we seralise the state;
Regarding Neely's comment on sampling, does that mean important things can
get lost when sampling?
Can we do causal profiling by virtually speeding up actors (i.e. slowing
down all but one actor)?
In Coz (the causal profiler) they measure latency by:
having a transactions counter that gets incremented when a client req
arrives and decremented when a response is written back to the client
having a counter for throughput which is increased on client resp
using Little's law: latency = transactions / throughput.
Is this anyhow more accurate than simply taking the difference in time
between client req and resp? Or do they simply do this because it's more
efficient in terms of CPU?
Using Little's law seems to make sense for client req/resp, because they
have a clear notion of transaction, but internal messages between nodes
don't have that and so we can't measure latency for those using the law?
The text was updated successfully, but these errors were encountered:
Sketch of how we could collect metrics for measuring performance inside the event loop.
Histograms
logarithmic bucketing rather than sampling
"Unlike popular metric systems today, this does not destroy the accuracy of
histograms by sampling. Instead, a logarithmic bucketing function compresses
values, generally within 1% of their true value (although between 0 and 1 the
precision loss may not be within this boundary). This allows for extreme
compression, which allows us to calculate arbitrarily high percentiles with no
loss of accuracy - just a small amount of precision. This is particularly
useful for highly-clustered events that are tolerant of a small precision loss,
but for which you REALLY care about what the tail looks like, such as measuring
latency across a distributed system." -- Tyler "spacejam" Neely
Simple interface, basically two functions:
measure(value: Double, h: Histogram)
percentile(p: Double, h: Histogram): Double
Implementation idea: any double can be compressed into one of 2^16 buckets,
with less than 1% compression loss (using the natural logarithm function for
compression and exponentiation for decompression, hence the name logarithmic
bucketing)
Rust crate: https://github.com/spacejam/historian
Golang lib: https://github.com/spacejam/loghisto
Metrics
Merely a record of histograms and counters
Together they capture the following metrics (taken from https://sled.rs/perf.html#metrics):
latency - the time that an operation takes
throughput - how many operations can be performed in some unit of time
utilization - the proportion of time that a system (server, disk, hashmap,
etc...) is busy handling requests, as opposed to waiting for
the next request to arrive.
saturation - the extent to which requests must queue before being handled
by the system, usually measured in terms of queue depth
(length).
space - whoah.
E.g. one histogram for client request/response latency, another one for
client req saturation (keeping track of what the queue depth was when the
client req arrived), and a counter for throughput
Built into the SUT, deployment agnostic, could be sampled by e.g. prometheus
or anything else?
Event loop metrics
USE (utilisation, saturation and errors)
We already mentioned client req/resp latency, client req saturation and
throughput above;
Main event loop utilisation: keep track of time when finished processing
last event, at the beginning of processing a new event measure the time
between last event finished processing and now, add the difference to the
sum of idle time;
We could do similar to above for async I/O worker thread pool also;
The async I/O work queue can also be measured for saturation same as the
event queue;
Actor utilisation: # of messages sent to actor / # of total messages sent;
Actor space: one crude way would be to check the length of the json string
when we seralise the state;
Errors?
TSA (Thread State Analysis)
What to optimise?
Metrics help us figure out how well the SUT performs, but doesn't tell us
anything about where to optimise if we want it to perform better
From https://sled.rs/perf.html#scouting-ahead
https://github.com/flamegraph-rs/flamegraph#systems-performance-work-guided-by-flamegraphs
Questions
Regarding Neely's comment on sampling, does that mean important things can
get lost when sampling?
Can we do causal profiling by virtually speeding up actors (i.e. slowing
down all but one actor)?
In Coz (the causal profiler) they measure latency by:
arrives and decremented when a response is written back to the client
Is this anyhow more accurate than simply taking the difference in time
between client req and resp? Or do they simply do this because it's more
efficient in terms of CPU?
Using Little's law seems to make sense for client req/resp, because they
have a clear notion of transaction, but internal messages between nodes
don't have that and so we can't measure latency for those using the law?
The text was updated successfully, but these errors were encountered: