More xsnap instrumentation #10424

mhofman · 2024-11-07T21:48:09Z

What is the Problem Being Solved?

We need to figure out where we're spending time in xsnap, and be able to quickly compare timings of the same delivery between validators, or compare equivalent deliveries in the same validator.

Description of the Design

Timing

Add a sum of wall clock time spend in xsnap while handling a command. Currently recordTimestamp only pushes the current timestamp in a fixed size buffer, and expects post processing to be done on the slog where these appear. We want to introduce some processing here to simplify analysis and queries. When receiving a command or response, xsnap should independently record the current timestamp, and when issuing a command or the final response, compute the time spent from that recorded timestamp, and add it to a "total time spend in xsnap" counter. Like the individual timestamps buffer is handled today, when receiving a command this total time counter would be reset, and when responding to the command, the total time counter would be included in the status data.

Similarly to the wall clock time, we also want to measure the accumulated CPU time spent in the xsnap process. While technically we could measure the CPU time between receiving the command and writing the response, we would likely prefer excluding the time spent reading / writing to the command pipe for syscalls performed during the delivery processing. It might be worth doing both?

@siarhei-agoric also suggested that we could hack XS to perform gc in different thread (but interlocked with the main thread) for the explicit purpose of measuring gc work separately from the main execution.

Memory info

We also want to report in the delivery status data some more detailed gc stats:

The number of full gc performed by xsnap (determine if we want to report compact gc separately?). Non-organic full gc are not worth reporting at this point as they are deterministically and externally triggered (snapshot and BOYD deliveries).
The number of heap slots "allocated" and "collected".
The amount of chunk memory "allocated" and "collected".

XS currently tracks the number of gc only when instrumenting is enabled, which I believe we cannot do without changing the shape of heap snapshots.
XS currently tracks the current and peak slots and chunks usages, but it does not track an accumulated meter of allocation and collection of these, which would allow us to get a better sense of the amount of these operations performed in a delivery.

We should report these up/down gc metrics scoped to the delivery (aka relative to their absolute value at the beginning of the delivery). We should continue to report the current heap count value at the end of the delivery, even though it could be derived from previous value + up during delivery - down during delivery. I believe we currently do not report the chunk size usage. From what I understand, the current allocate value represents raw memory ever (peak) allocated for both heap and chunks usage.

Security Considerations

None. These new counter should, for now, remain outside of consensus and not affect it, even indirectly (heap snapshots). I believe that means they cannot be stored on the machine info.

Currently only the compute data is explicitly being recorded in consensus data as part of the delivery result (in the transcript store)

Scaling Considerations

These are just single counters and should have little impact on the performance of the xsnap worker, and should only have O(1) memory usage.

We may want to reset allocation counters at the beginning of each command / delivery to limit the risk of overflow since we're never reporting accumulated results across them.

Test Plan

It may be sufficient to observe these new fields in the results of existing xsnap tests. In a perfect world we'd have unit tests demonstrating each scenario.

Upgrade Considerations

This affects chain software but should be implemented in a way that it doesn't affect any consensus state so that it can be applied as a cherry-pick, including on previous software releases such as upgrade 16 & 17 to gather information to diagnose documented performance issues.

Relevant other issues

XS fxGrowKeys should delegate to a new fxAllocateKeys, track with a meter #7509
Better represent heap cost in run policy #7373. In particular the gc information generated here for diagnostics may later become part of consensus information recorded in transcript, and informing the run policy

The text was updated successfully, but these errors were encountered:

mhofman added enhancement New feature or request xsnap the XS execution tool telemetry labels Nov 7, 2024

mhofman assigned siarhei-agoric Nov 7, 2024

mhofman mentioned this issue Nov 7, 2024

Better represent heap cost in run policy #7373

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More xsnap instrumentation #10424

More xsnap instrumentation #10424

mhofman commented Nov 7, 2024

More xsnap instrumentation #10424

More xsnap instrumentation #10424

Comments

mhofman commented Nov 7, 2024

What is the Problem Being Solved?

Description of the Design

Timing

Memory info

Security Considerations

Scaling Considerations

Test Plan

Upgrade Considerations

Relevant other issues