You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need to figure out where we're spending time in xsnap, and be able to quickly compare timings of the same delivery between validators, or compare equivalent deliveries in the same validator.
Description of the Design
Timing
Add a sum of wall clock time spend in xsnap while handling a command. Currently recordTimestamp only pushes the current timestamp in a fixed size buffer, and expects post processing to be done on the slog where these appear. We want to introduce some processing here to simplify analysis and queries. When receiving a command or response, xsnap should independently record the current timestamp, and when issuing a command or the final response, compute the time spent from that recorded timestamp, and add it to a "total time spend in xsnap" counter. Like the individual timestamps buffer is handled today, when receiving a command this total time counter would be reset, and when responding to the command, the total time counter would be included in the status data.
Similarly to the wall clock time, we also want to measure the accumulated CPU time spent in the xsnap process. While technically we could measure the CPU time between receiving the command and writing the response, we would likely prefer excluding the time spent reading / writing to the command pipe for syscalls performed during the delivery processing. It might be worth doing both?
@siarhei-agoric also suggested that we could hack XS to perform gc in different thread (but interlocked with the main thread) for the explicit purpose of measuring gc work separately from the main execution.
Memory info
We also want to report in the delivery status data some more detailed gc stats:
The number of full gc performed by xsnap (determine if we want to report compact gc separately?). Non-organic full gc are not worth reporting at this point as they are deterministically and externally triggered (snapshot and BOYD deliveries).
The amount of chunk memory "allocated" and "collected".
XS currently tracks the number of gc only when instrumenting is enabled, which I believe we cannot do without changing the shape of heap snapshots.
XS currently tracks the current and peak slots and chunks usages, but it does not track an accumulated meter of allocation and collection of these, which would allow us to get a better sense of the amount of these operations performed in a delivery.
We should report these up/down gc metrics scoped to the delivery (aka relative to their absolute value at the beginning of the delivery). We should continue to report the current heap count value at the end of the delivery, even though it could be derived from previous value + up during delivery - down during delivery. I believe we currently do not report the chunk size usage. From what I understand, the current allocate value represents raw memory ever (peak) allocated for both heap and chunks usage.
Security Considerations
None. These new counter should, for now, remain outside of consensus and not affect it, even indirectly (heap snapshots). I believe that means they cannot be stored on the machine info.
Currently only the compute data is explicitly being recorded in consensus data as part of the delivery result (in the transcript store)
Scaling Considerations
These are just single counters and should have little impact on the performance of the xsnap worker, and should only have O(1) memory usage.
We may want to reset allocation counters at the beginning of each command / delivery to limit the risk of overflow since we're never reporting accumulated results across them.
Test Plan
It may be sufficient to observe these new fields in the results of existing xsnap tests. In a perfect world we'd have unit tests demonstrating each scenario.
Upgrade Considerations
This affects chain software but should be implemented in a way that it doesn't affect any consensus state so that it can be applied as a cherry-pick, including on previous software releases such as upgrade 16 & 17 to gather information to diagnose documented performance issues.
Better represent heap cost in run policy #7373. In particular the gc information generated here for diagnostics may later become part of consensus information recorded in transcript, and informing the run policy
The text was updated successfully, but these errors were encountered:
What is the Problem Being Solved?
We need to figure out where we're spending time in xsnap, and be able to quickly compare timings of the same delivery between validators, or compare equivalent deliveries in the same validator.
Description of the Design
Timing
Add a sum of wall clock time spend in xsnap while handling a command. Currently
recordTimestamp
only pushes the current timestamp in a fixed size buffer, and expects post processing to be done on the slog where these appear. We want to introduce some processing here to simplify analysis and queries. When receiving a command or response, xsnap should independently record the current timestamp, and when issuing a command or the final response, compute the time spent from that recorded timestamp, and add it to a "total time spend in xsnap" counter. Like the individual timestamps buffer is handled today, when receiving a command this total time counter would be reset, and when responding to the command, the total time counter would be included in the status data.Similarly to the wall clock time, we also want to measure the accumulated CPU time spent in the xsnap process. While technically we could measure the CPU time between receiving the command and writing the response, we would likely prefer excluding the time spent reading / writing to the command pipe for syscalls performed during the delivery processing. It might be worth doing both?
@siarhei-agoric also suggested that we could hack XS to perform gc in different thread (but interlocked with the main thread) for the explicit purpose of measuring gc work separately from the main execution.
Memory info
We also want to report in the delivery status data some more detailed gc stats:
XS currently tracks the number of gc only when instrumenting is enabled, which I believe we cannot do without changing the shape of heap snapshots.
XS currently tracks the current and peak slots and chunks usages, but it does not track an accumulated meter of allocation and collection of these, which would allow us to get a better sense of the amount of these operations performed in a delivery.
We should report these up/down gc metrics scoped to the delivery (aka relative to their absolute value at the beginning of the delivery). We should continue to report the current heap count value at the end of the delivery, even though it could be derived from previous value + up during delivery - down during delivery. I believe we currently do not report the chunk size usage. From what I understand, the current
allocate
value represents raw memory ever (peak) allocated for both heap and chunks usage.Security Considerations
None. These new counter should, for now, remain outside of consensus and not affect it, even indirectly (heap snapshots). I believe that means they cannot be stored on the machine info.
Currently only the compute data is explicitly being recorded in consensus data as part of the delivery result (in the transcript store)
Scaling Considerations
These are just single counters and should have little impact on the performance of the xsnap worker, and should only have O(1) memory usage.
We may want to reset allocation counters at the beginning of each command / delivery to limit the risk of overflow since we're never reporting accumulated results across them.
Test Plan
It may be sufficient to observe these new fields in the results of existing xsnap tests. In a perfect world we'd have unit tests demonstrating each scenario.
Upgrade Considerations
This affects chain software but should be implemented in a way that it doesn't affect any consensus state so that it can be applied as a cherry-pick, including on previous software releases such as upgrade 16 & 17 to gather information to diagnose documented performance issues.
Relevant other issues
The text was updated successfully, but these errors were encountered: