-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
syscall.retireImports translation error (when deleting WeakSet) #9939
Comments
I captured two different slogfiles of non-crashing runs, and they show different GC behavior, which explains why the failure is intermittent: V8 is being weird about GC (our old friend #3240), and some runs see different WeakRef probes or finalization notifications than others. But liveslots is supposed to be tolerant of that, as long as the engine is consistent, so this points to either a bug in liveslots, or some interesting failure mode of the V8 GC sensors and how liveslots is using them. |
I saw the same failure in CI under Node.js v18, ruling out that v20 is the only culprit:
|
ok I think I see a pathway by which a The slog trace of the failing run shows v15 doing a BOYD which does three passes through the For o-54 to have gotten into The scenario I'm visualizing is where the vat has a voAwareWeakMap, this o-54 Presence is added as a key, then userspace drops the weakmap. On the next BOYD, the first two passes do their If the first pass was able to shake it loose, I'd expect So maybe this condition could be reached without V8 being weird, which raises the concerning possibility that it could be reached from XS. It would require a Presence to be reachable from virtual data and also recognizable from a WeakMap or WeakSet, but not have a RAM pillar. The RAM pillar should be dropped, then a BOYD should be done. Then the WeakMap should be deleted, then we do a second BOYD. (If a single BOYD both processed the RAM-pillar drop and the I think the fix will be for the I don't understand the |
I was able to write a unit test that triggers this.. defintely a liveslots bug. |
Ok I think the requirements to trigger the bug are:
I don't yet know if using the vref in a (virtual/durable) WeakStore would trigger the problem. @michaelfig tells me that I think it might also be exacerbated by the |
More specifically, Vow The "then, the WeakMap/Set is deleted" would be caused by liveslots collecting the watcher in the first BOYD, which would cause the WeakSet to subsequently become collectible by the engine. agoric-sdk/packages/vow/src/watch.js Lines 69 to 82 in b84a426
|
So the way I want to fix this is to make the liveslots |
For these short-lived WeakSets and/or WeakMaps as used by vows, what are the keys? What are the expected lifetimes of those keys? Do we expect such keys to have been used as keys in many short-lived WeakSets/WeakMaps? I ask because that falls into the one case where XS weak gc is expensive, and is expected to remain expensive. Attn @phoddie @patrick-soquet In general, when we know the WeakSet/WeakMap is expected to have a shorter lifetime than its keys, and its keys will be reused across several such WeakSets/WeakMaps, then we should ask if we should just use a short-lived Set/Map instead. If we can, we should. That's why I am relaxed about the remaining expense of that one case. |
I realized I can't do that, at least not for imports (Presences). If we'd included an additional "import status" key (e.g. We can't really retrofit that on now, and it would cost an extra DB key per vref so I'm not positive we'd be willing to pay that cost anyways. So we've got a constraint to manage: you must not add a vref to One other improvement we can make is to remove some of the duplicate refcount checking. For a virtual/durable object (whose RAM pillar is a Representative), the first phase does a So I'm trying to sketch out a clean way to organize It may help to cache the reachability/recognizability check, but we have to be careful about cache invalidation. We could share the check between the first and second phases, but the moment we delete a VOM, we might invalidate the refcounts. |
These are liveslots virtualized WeakSets used most likely (but not always) with virtual or durable objects as entries, as such the GC profile of XS does not come into play, but instead it's the liveslots gc behavior that does. |
This sounds related to my suggested fix for #9338: track in durable storage the full status of presences and representatives. |
refs #9939 WIP: more precise failing test
This rewrites scanForDeadObjects(), which is called during dispatch.bringOutYourDead to process possiblyDeadSet and possiblyRetiredSet. The new flow should be easier to review and understand. The main behavioral difference is to fix a bug (#9939) in which a vref that appears in possiblyRetiredSet (because e.g. a weak collection was deleted, which was using that vref as a key), but which 1: lacks a RAM pillar (Presence object) and 2: was not dropped in this BOYD (e.g. it has a vdata pillar), used to be sent to the kernel in a bogus `syscall.retireImports()` call. Because this vref was not previously dropped by the vat (syscall.dropImports()), this was a vat-fatal error. The new code will only retire such a Presence vref if it was not reachable by the vat. fixes #9939
This rewrites scanForDeadObjects(), which is called during dispatch.bringOutYourDead to process possiblyDeadSet and possiblyRetiredSet. The new flow should be easier to review and understand. The main behavioral difference is to fix a bug (#9939) in which a vref that appears in possiblyRetiredSet (because e.g. a weak collection was deleted, which was using that vref as a key), but which 1: lacks a RAM pillar (Presence object) and 2: was not dropped in this BOYD (e.g. it has a vdata pillar), used to be sent to the kernel in a bogus `syscall.retireImports()` call. Because this vref was not previously dropped by the vat (syscall.dropImports()), this was a vat-fatal error. The new code will only retire such a Presence vref if it was not reachable by the vat. The new tests are marked as expected to pass again. thanks @mhofman and @gibson042 for recommendations fixes #9939
This reverts commit 064ff1a. Now that the underlying issue is fixed, we can re-enable this formerly-flaky test. Thanks @michaelfig for your patience.
Rewrite scanForDeadObjects(), which is called during dispatch.bringOutYourDead to process possiblyDeadSet and possiblyRetiredSet. The new flow should be easier to review and understand. The main behavioral difference is to fix a bug (#9939) in which a vref that appears in possiblyRetiredSet (because e.g. a weak collection was deleted, which was using that vref as a key), but which 1: lacks a RAM pillar (Presence object) and 2: was not dropped in this BOYD (e.g. it has a vdata pillar), used to be sent to the kernel in a bogus `syscall.retireImports()` call. Because this vref was not previously dropped by the vat (syscall.dropImports()), this was a vat-fatal error. The new code will only retire such a Presence vref if it was not reachable by the vat. fixes #9939
Michael observed an intermittent CI failure with:
I was able to reproduce it locally (and capture slogfiles) by re-running
yarn test test/orchestration/restart-contracts.test.ts
inpackages/boot
on that PR's branch a dozen times.We're still analyzing the slogs, but it looks like liveslots misbehaved, and did a
retireImports
without doing adropImports
first.@mhofman suspects that our use of
WeakRef
probing (rather than relying upon theFinalizationRegistry
's notification) is causing problems under V8 which we wouldn't see under XS, and this test is using local workers, not xsnap.The text was updated successfully, but these errors were encountered: