Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event loop metrics #279

Open
symbiont-stevan-andjelkovic opened this issue Jun 18, 2021 · 0 comments
Open

Event loop metrics #279

symbiont-stevan-andjelkovic opened this issue Jun 18, 2021 · 0 comments
Milestone

Comments

@symbiont-stevan-andjelkovic
Copy link
Contributor

symbiont-stevan-andjelkovic commented Jun 18, 2021

Sketch of how we could collect metrics for measuring performance inside the event loop.

  • Histograms

    • logarithmic bucketing rather than sampling

      "Unlike popular metric systems today, this does not destroy the accuracy of
      histograms by sampling. Instead, a logarithmic bucketing function compresses
      values, generally within 1% of their true value (although between 0 and 1 the
      precision loss may not be within this boundary). This allows for extreme
      compression, which allows us to calculate arbitrarily high percentiles with no
      loss of accuracy - just a small amount of precision. This is particularly
      useful for highly-clustered events that are tolerant of a small precision loss,
      but for which you REALLY care about what the tail looks like, such as measuring
      latency across a distributed system." -- Tyler "spacejam" Neely

    • Simple interface, basically two functions:

      • measure(value: Double, h: Histogram)
      • percentile(p: Double, h: Histogram): Double
      • E.g.:
         h := newHistogram
         measure(1, h)
         measure(1, h)
         measure(2, h)
         measure(2, h)
         measure(3, h)
         assert(percentile(0.0, h), 1) // min
         assert(percentile(40.0, h), 1)
         assert(percentile(40.1, h), 2)
         assert(percentile(80.0, h), 2)
         assert(percentile(80.1, h), 3)
         assert(percentile(100.0, h), 3) // max
        
    • Implementation idea: any double can be compressed into one of 2^16 buckets,
      with less than 1% compression loss (using the natural logarithm function for
      compression and exponentiation for decompression, hence the name logarithmic
      bucketing)

    • Rust crate: https://github.com/spacejam/historian

    • Golang lib: https://github.com/spacejam/loghisto

  • Metrics

    • Merely a record of histograms and counters

    • Together they capture the following metrics (taken from https://sled.rs/perf.html#metrics):

      • latency - the time that an operation takes

      • throughput - how many operations can be performed in some unit of time

      • utilization - the proportion of time that a system (server, disk, hashmap,
        etc...) is busy handling requests, as opposed to waiting for
        the next request to arrive.

      • saturation - the extent to which requests must queue before being handled
        by the system, usually measured in terms of queue depth
        (length).

      • space - whoah.

    • E.g. one histogram for client request/response latency, another one for
      client req saturation (keeping track of what the queue depth was when the
      client req arrived), and a counter for throughput

    • Built into the SUT, deployment agnostic, could be sampled by e.g. prometheus
      or anything else?

  • Event loop metrics

    • USE (utilisation, saturation and errors)

      • We already mentioned client req/resp latency, client req saturation and
        throughput above;

      • Main event loop utilisation: keep track of time when finished processing
        last event, at the beginning of processing a new event measure the time
        between last event finished processing and now, add the difference to the
        sum of idle time;

      • We could do similar to above for async I/O worker thread pool also;

      • The async I/O work queue can also be measured for saturation same as the
        event queue;

      • Actor utilisation: # of messages sent to actor / # of total messages sent;

      • Actor space: one crude way would be to check the length of the json string
        when we seralise the state;

      • Errors?

    • TSA (Thread State Analysis)

  • What to optimise?

  • Questions

    • Regarding Neely's comment on sampling, does that mean important things can
      get lost when sampling?

    • Can we do causal profiling by virtually speeding up actors (i.e. slowing
      down all but one actor)?

    • In Coz (the causal profiler) they measure latency by:

      1. having a transactions counter that gets incremented when a client req
        arrives and decremented when a response is written back to the client
      2. having a counter for throughput which is increased on client resp
      3. using Little's law: latency = transactions / throughput.

      Is this anyhow more accurate than simply taking the difference in time
      between client req and resp? Or do they simply do this because it's more
      efficient in terms of CPU?

    • Using Little's law seems to make sense for client req/resp, because they
      have a clear notion of transaction, but internal messages between nodes
      don't have that and so we can't measure latency for those using the law?

@symbiont-stevan-andjelkovic symbiont-stevan-andjelkovic added this to the v0.1.0 milestone Jun 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant