Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More xsnap instrumentation #10424

Open
mhofman opened this issue Nov 7, 2024 · 0 comments
Open

More xsnap instrumentation #10424

mhofman opened this issue Nov 7, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request telemetry xsnap the XS execution tool

Comments

@mhofman
Copy link
Member

mhofman commented Nov 7, 2024

What is the Problem Being Solved?

We need to figure out where we're spending time in xsnap, and be able to quickly compare timings of the same delivery between validators, or compare equivalent deliveries in the same validator.

Description of the Design

Timing

Add a sum of wall clock time spend in xsnap while handling a command. Currently recordTimestamp only pushes the current timestamp in a fixed size buffer, and expects post processing to be done on the slog where these appear. We want to introduce some processing here to simplify analysis and queries. When receiving a command or response, xsnap should independently record the current timestamp, and when issuing a command or the final response, compute the time spent from that recorded timestamp, and add it to a "total time spend in xsnap" counter. Like the individual timestamps buffer is handled today, when receiving a command this total time counter would be reset, and when responding to the command, the total time counter would be included in the status data.

Similarly to the wall clock time, we also want to measure the accumulated CPU time spent in the xsnap process. While technically we could measure the CPU time between receiving the command and writing the response, we would likely prefer excluding the time spent reading / writing to the command pipe for syscalls performed during the delivery processing. It might be worth doing both?

@siarhei-agoric also suggested that we could hack XS to perform gc in different thread (but interlocked with the main thread) for the explicit purpose of measuring gc work separately from the main execution.

Memory info

We also want to report in the delivery status data some more detailed gc stats:

  • The number of full gc performed by xsnap (determine if we want to report compact gc separately?). Non-organic full gc are not worth reporting at this point as they are deterministically and externally triggered (snapshot and BOYD deliveries).
  • The number of heap slots "allocated" and "collected".
  • The amount of chunk memory "allocated" and "collected".

XS currently tracks the number of gc only when instrumenting is enabled, which I believe we cannot do without changing the shape of heap snapshots.
XS currently tracks the current and peak slots and chunks usages, but it does not track an accumulated meter of allocation and collection of these, which would allow us to get a better sense of the amount of these operations performed in a delivery.

We should report these up/down gc metrics scoped to the delivery (aka relative to their absolute value at the beginning of the delivery). We should continue to report the current heap count value at the end of the delivery, even though it could be derived from previous value + up during delivery - down during delivery. I believe we currently do not report the chunk size usage. From what I understand, the current allocate value represents raw memory ever (peak) allocated for both heap and chunks usage.

Security Considerations

None. These new counter should, for now, remain outside of consensus and not affect it, even indirectly (heap snapshots). I believe that means they cannot be stored on the machine info.

Currently only the compute data is explicitly being recorded in consensus data as part of the delivery result (in the transcript store)

Scaling Considerations

These are just single counters and should have little impact on the performance of the xsnap worker, and should only have O(1) memory usage.

We may want to reset allocation counters at the beginning of each command / delivery to limit the risk of overflow since we're never reporting accumulated results across them.

Test Plan

It may be sufficient to observe these new fields in the results of existing xsnap tests. In a perfect world we'd have unit tests demonstrating each scenario.

Upgrade Considerations

This affects chain software but should be implemented in a way that it doesn't affect any consensus state so that it can be applied as a cherry-pick, including on previous software releases such as upgrade 16 & 17 to gather information to diagnose documented performance issues.

Relevant other issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request telemetry xsnap the XS execution tool
Projects
None yet
Development

No branches or pull requests

2 participants