Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown of XS worker #6661

Open
mhofman opened this issue Dec 11, 2022 · 5 comments
Open

Slowdown of XS worker #6661

mhofman opened this issue Dec 11, 2022 · 5 comments
Labels
bug Something isn't working vaults_triage DO NOT USE xsnap the XS execution tool

Comments

@mhofman
Copy link
Member

mhofman commented Dec 11, 2022

Describe the bug

An xsnap worker loading from a recent snapshot is faster than one that executed without restart. The memory usage is also higher.

To Reproduce

Use the transcript-replay.js tool, configuring it to keep multiple workers (feature currently only available on branch mhofman/6588-diagnose)

Expected behavior

This kind of slowdown / memory growth should not happen as the resource usage should entirely depend on the program being executed.

Platform Environment

  • what OS are you using? what version of Node.js? Docker on Linux WSL, Node 16 (irrelevant)
  • is there anything special/unusual about your platform? Replay tool
  • what version of the Agoric-SDK are you using? Branch mhofman/6588-diagnose

Additional context

Issue first discovered by @warner while investigating #6625. Confirmed by adding stats to replay tool while investigating #6588.

Screenshots

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 6837 node      20   0 1228260 214984   1588 R  86.7   2.6 193:13.33 xsnap-worker
 6854 node      20   0 1228100 216208   1500 R  86.3   2.7 191:33.62 xsnap-worker
 7621 node      20   0 1227616 216144   1480 R  80.7   2.7 179:34.44 xsnap-worker
 8928 node      20   0 1227204 215556   1512 R  75.0   2.6 163:27.66 xsnap-worker
11735 node      20   0 1226756 215216   1520 R  68.3   2.6 149:52.72 xsnap-worker
15897 node      20   0 1226272 214432   1480 R  63.7   2.6 134:11.45 xsnap-worker
21558 node      20   0 1225828 214124   1472 R  56.3   2.6 116:36.82 xsnap-worker
29043 node      20   0 1225352 213052   1504 R  52.7   2.6  95:41.86 xsnap-worker
21009 node      20   0 1224552 212804   1528 R  50.0   2.6  59:07.48 xsnap-worker
 4433 node      20   0 1194156 183496   2856 R  45.0   2.3  32:56.29 xsnap-worker
31375 node      20   0 1223960 211848   1480 R  45.0   2.6  41:37.15 xsnap-worker
11276 node      20   0 1194052 183512   2856 R  43.0   2.3  23:32.73 xsnap-worker
 7021 node      20   0 1194104 183616   2984 R  41.7   2.3  29:01.86 xsnap-worker
18330 node      20   0 1193984 183556   2884 R  40.7   2.3  15:22.85 xsnap-worker
28662 node      20   0 1193904 183560   2892 R  39.3   2.3   1:42.29 xsnap-worker
29895 node      20   0 1193800 183908   3424 R  39.0   2.3   0:43.17 xsnap-worker
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 6837.
    Save time = 3.647s. Delivery time since last snapshot 133.835s. Up 89002 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 6854.
    Save time = 3.804s. Delivery time since last snapshot 133.685s. Up 89000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 7621.
    Save time = 3.605s. Delivery time since last snapshot 128.292s. Up 79000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 8928.
    Save time = 3.781s. Delivery time since last snapshot 121.568s. Up 69000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 11735.
    Save time = 3.503s. Delivery time since last snapshot 116.482s. Up 59000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 15897.
    Save time = 3.358s. Delivery time since last snapshot 111.398s. Up 49000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 21558.
    Save time = 3.537s. Delivery time since last snapshot 104.596s. Up 39000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 29043.
    Save time = 3.345s. Delivery time since last snapshot 99.134s. Up 29000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 21009.
    Save time = 3.3s. Delivery time since last snapshot 95.032s. Up 19000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 31375.
    Save time = 3.269s. Delivery time since last snapshot 87.107s. Up 9000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 4433.
    Save time = 2.767s. Delivery time since last snapshot 85.493s. Up 6000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 7021.
    Save time = 2.711s. Delivery time since last snapshot 85.655s. Up 5000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 11276.
    Save time = 3.223s. Delivery time since last snapshot 86.102s. Up 4000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 18330.
    Save time = 2.733s. Delivery time since last snapshot 90.439s. Up 3000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 28662.
    Save time = 2.15s. Delivery time since last snapshot 84.705s. Up 2000 deliveries.
made snapshot 6bbb1e7b5c589db9aa72a0ffc6b4bc96e1e34f73b2f4f03bf3b10f6ed32e5bc9 after delivery 89002 to worker PID 29895.
    Save time = 2.541s. Delivery time since last snapshot 86.967s. Up 1000 deliveries.
@mhofman mhofman added bug Something isn't working xsnap the XS execution tool labels Dec 11, 2022
@warner
Copy link
Member

warner commented Dec 13, 2022

My only hypothesis on this so far is some kind of memory fragmentation, but if the memory footprint (VmSize/RSS) is growing too, then maybe the list of active slots has a bunch of useless/dummy entries in it, or something.

@warner
Copy link
Member

warner commented Dec 13, 2022

Just so I don't forget, CRIU (e.g. https://criu.org/Simple_loop) is a tool that performs userspace snapshotting/restore of Linux processes, and might conceivably be useful in building some faster-to-restart reproduction tools for this (to snapshot an xsnap that has been "aged" into a slow-behavior state, and the restore a separate copy for each experiment, rather than needing to re-age an xsnap for each experiment you want to run).

@mhofman
Copy link
Member Author

mhofman commented Feb 8, 2023

We believe some of the slowdown is somewhat visible on spreadsheets behind the investigation of #6786.

@mhofman
Copy link
Member Author

mhofman commented Feb 10, 2023

A re-run of all vat transcripts under both a no-restart and a forced restart after every snapshot (executing only a single worker concurrently in both cases), show that mostly vat17 and vat18 are impacted by a roughly 200% slowdown. vat6 is experiencing a much smaller, but still noticeable slowdown (25%), while vat7 which has a similar amount of total execution time experienced no slowdown at all. vat1 experienced a small 6% slowdown. Other vats had no meaningful execution time, and were all consistent.

vatID Avg execution time (no restart) Avg execution time (forced restart)
v18 80.70 25.00
v17 53.86 18.40
v6 11.79 9.53
v7 6.44 6.44
v1 2.52 2.38

(Execution time in seconds of 1000 deliveries between snapshots)

Graph of forced restart execution times per vat:
image

Graph of no restart execution times per vat:
image

@warner
Copy link
Member

warner commented Feb 13, 2023

cc @raphdev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working vaults_triage DO NOT USE xsnap the XS execution tool
Projects
None yet
Development

No branches or pull requests

3 participants