Event loop metrics #279

symbiont-stevan-andjelkovic · 2021-06-18T10:46:51Z

Sketch of how we could collect metrics for measuring performance inside the event loop.

Histograms
- logarithmic bucketing rather than sampling
  
  "Unlike popular metric systems today, this does not destroy the accuracy of
  histograms by sampling. Instead, a logarithmic bucketing function compresses
  values, generally within 1% of their true value (although between 0 and 1 the
  precision loss may not be within this boundary). This allows for extreme
  compression, which allows us to calculate arbitrarily high percentiles with no
  loss of accuracy - just a small amount of precision. This is particularly
  useful for highly-clustered events that are tolerant of a small precision loss,
  but for which you REALLY care about what the tail looks like, such as measuring
  latency across a distributed system." -- Tyler "spacejam" Neely
- Simple interface, basically two functions:
  - measure(value: Double, h: Histogram)
  - percentile(p: Double, h: Histogram): Double
  - E.g.:
```
 h := newHistogram
 measure(1, h)
 measure(1, h)
 measure(2, h)
 measure(2, h)
 measure(3, h)
 assert(percentile(0.0, h), 1) // min
 assert(percentile(40.0, h), 1)
 assert(percentile(40.1, h), 2)
 assert(percentile(80.0, h), 2)
 assert(percentile(80.1, h), 3)
 assert(percentile(100.0, h), 3) // max
```
- Implementation idea: any double can be compressed into one of 2^16 buckets,
  with less than 1% compression loss (using the natural logarithm function for
  compression and exponentiation for decompression, hence the name logarithmic
  bucketing)
- Rust crate: https://github.com/spacejam/historian
- Golang lib: https://github.com/spacejam/loghisto
Metrics
- Merely a record of histograms and counters
- Together they capture the following metrics (taken from https://sled.rs/perf.html#metrics):
  - latency - the time that an operation takes
  - throughput - how many operations can be performed in some unit of time
  - utilization - the proportion of time that a system (server, disk, hashmap,
    etc...) is busy handling requests, as opposed to waiting for
    the next request to arrive.
  - saturation - the extent to which requests must queue before being handled
    by the system, usually measured in terms of queue depth
    (length).
  - space - whoah.
- E.g. one histogram for client request/response latency, another one for
  client req saturation (keeping track of what the queue depth was when the
  client req arrived), and a counter for throughput
- Built into the SUT, deployment agnostic, could be sampled by e.g. prometheus
  or anything else?
Event loop metrics
- USE (utilisation, saturation and errors)
  - We already mentioned client req/resp latency, client req saturation and
    throughput above;
  - Main event loop utilisation: keep track of time when finished processing
    last event, at the beginning of processing a new event measure the time
    between last event finished processing and now, add the difference to the
    sum of idle time;
  - We could do similar to above for async I/O worker thread pool also;
  - The async I/O work queue can also be measured for saturation same as the
    event queue;
  - Actor utilisation: # of messages sent to actor / # of total messages sent;
  - Actor space: one crude way would be to check the length of the json string
    when we seralise the state;
  - Errors?
- TSA (Thread State Analysis)
What to optimise?
- Metrics help us figure out how well the SUT performs, but doesn't tell us
  anything about where to optimise if we want it to perform better
- From https://sled.rs/perf.html#scouting-ahead
  - flamegraphs, see also:
    https://github.com/flamegraph-rs/flamegraph#systems-performance-work-guided-by-flamegraphs
  - deletion profiling
  - causal profiling
Questions
- Regarding Neely's comment on sampling, does that mean important things can
  get lost when sampling?
- Can we do causal profiling by virtually speeding up actors (i.e. slowing
  down all but one actor)?
- In Coz (the causal profiler) they measure latency by:
  1. having a transactions counter that gets incremented when a client req
    arrives and decremented when a response is written back to the client
  2. having a counter for throughput which is increased on client resp
  3. using Little's law: latency = transactions / throughput.
  Is this anyhow more accurate than simply taking the difference in time
  between client req and resp? Or do they simply do this because it's more
  efficient in terms of CPU?
- Using Little's law seems to make sense for client req/resp, because they
  have a clear notion of transaction, but internal messages between nodes
  don't have that and so we can't measure latency for those using the law?

The text was updated successfully, but these errors were encountered:

symbiont-stevan-andjelkovic added this to the v0.1.0 milestone Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event loop metrics #279

Event loop metrics #279

symbiont-stevan-andjelkovic commented Jun 18, 2021 •

edited

Loading

Event loop metrics #279

Event loop metrics #279

Comments

symbiont-stevan-andjelkovic commented Jun 18, 2021 • edited Loading

symbiont-stevan-andjelkovic commented Jun 18, 2021 •

edited

Loading