syscall.retireImports translation error (when deleting WeakSet) #9939

warner · 2024-08-22T01:32:29Z

Michael observed an intermittent CI failure with:

----- RemoteChainFacade.5  2 making a RemoteChainFacade
error during syscall translation: (Error#7)
Error#7: syscall.retireImports but o-54 is still reachable
    at eval (.../swingset-vat/src/kernel/vatTranslator.js:437:7)

I was able to reproduce it locally (and capture slogfiles) by re-running yarn test test/orchestration/restart-contracts.test.ts in packages/boot on that PR's branch a dozen times.

We're still analyzing the slogs, but it looks like liveslots misbehaved, and did a retireImports without doing a dropImports first.

@mhofman suspects that our use of WeakRef probing (rather than relying upon the FinalizationRegistry's notification) is causing problems under V8 which we wouldn't see under XS, and this test is using local workers, not xsnap.

The text was updated successfully, but these errors were encountered:

warner · 2024-08-22T01:42:08Z

I captured two different slogfiles of non-crashing runs, and they show different GC behavior, which explains why the failure is intermittent: V8 is being weird about GC (our old friend #3240), and some runs see different WeakRef probes or finalization notifications than others. But liveslots is supposed to be tolerant of that, as long as the engine is consistent, so this points to either a bug in liveslots, or some interesting failure mode of the V8 GC sensors and how liveslots is using them.

mhofman · 2024-08-22T01:46:16Z

Possibly related:

michaelfig · 2024-08-22T03:51:37Z

I saw the same failure in CI under Node.js v18, ruling out that v20 is the only culprit:

----- RemoteChainFacade.5  2 making a RemoteChainFacade
error during syscall translation: (Error#7)
Error#7: syscall.retireImports but o-54 is still reachable
    at eval (.../swingset-vat/src/kernel/vatTranslator.js:437:7)
    at Array.map (<anonymous>)
    at translateRetireImports (.../swingset-vat/src/kernel/vatTranslator.js:431:19)
    at Object.vatSyscallToKernelSyscall (.../swingset-vat/src/kernel/vatTranslator.js:620:8)
    at vatSyscallHandler (.../swingset-vat/src/kernel/kernel.js:1563:17)
    at syscallHandler (.../swingset-vat/src/kernel/vat-warehouse.js:42:12)
    at syscallFromWorker (.../swingset-vat/src/kernel/vat-loader/manager-helper.js:164:11)
    at doSyscall (.../swingset-vat/src/supervisors/supervisor-helper.js:72:3)
    at Object.retireImports (.../swingset-vat/src/supervisors/supervisor-helper.js:104:24)
    at scanForDeadObjects (.../swingset-liveslots/src/liveslots.js:338:9)
    at async bringOutYourDead (.../swingset-liveslots/src/liveslots.js:1549:1)
    at async Object.deliver (.../swingset-vat/src/kernel/vat-loader/manager-helper.js:146:14)
    at async Object.deliverToVat (.../swingset-vat/src/kernel/vat-warehouse.js:537:22)
    at async deliverAndLogToVat (.../swingset-vat/src/kernel/kernel.js:425:22)
    at async processBringOutYourDead (.../swingset-vat/src/kernel/kernel.js:666:14)
    at async deliverRunQueueEvent (.../swingset-vat/src/kernel/kernel.js:1298:15)
    at async processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1337:20)
    at async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
    at async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)

error during vat dispatch() of bringOutYourDead (Error#8)
Error#8: syscall.retireImports failed: syscall translation error: prepare to die
    at doSyscall (.../swingset-vat/src/supervisors/supervisor-helper.js:87:7)
    at Object.retireImports (.../swingset-vat/src/supervisors/supervisor-helper.js:104:24)
    at scanForDeadObjects (.../swingset-liveslots/src/liveslots.js:338:9)
    at async bringOutYourDead (.../swingset-liveslots/src/liveslots.js:1549:1)
    at async Object.deliver (.../swingset-vat/src/kernel/vat-loader/manager-helper.js:146:14)
    at async Object.deliverToVat (.../swingset-vat/src/kernel/vat-warehouse.js:537:22)
    at async deliverAndLogToVat (.../swingset-vat/src/kernel/kernel.js:425:22)
    at async processBringOutYourDead (.../swingset-vat/src/kernel/kernel.js:666:14)
    at async deliverRunQueueEvent (.../swingset-vat/src/kernel/kernel.js:1298:15)
    at async processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1337:20)
    at async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
    at async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)

kernel terminating vat v15 (failure=true)
##### KERNEL PANIC: critical vat v15 failed #####
controller.run() failure (Error#9)
Error#9: #{"#error":"syscall translation error: prepare to die","name":"Error"}
    at terminateVat (.../swingset-vat/src/kernel/kernel.js:304:39)
    at processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1401:7)
    at async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
    at async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)

not ok 12 - orchestration › restart-contracts › basicFlows %ava-dur=17379ms
#   start basicFlows
#   making offer
#   REJECTED from ava test.serial("basicFlows"): (Error#9)
#   Error#9: #{"#error":"syscall translation error: prepare to die","name":"Error"}
#       at terminateVat (.../swingset-vat/src/kernel/kernel.js:304:39)
#       at processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1401:7)
#       at async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
#       at async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)

  ---
    name: AssertionError
    message: Rejected promise returned by test
    values:
      'Rejected promise returned by test. Reason:': |-
        Error {
          message: '#{"#error":"syscall translation error: prepare to die","name":"Error"}',
        }
    at: |-
      terminateVat (.../swingset-vat/src/kernel/kernel.js:304:39)
      processDeliveryMessage (.../swingset-vat/src/kernel/kernel.js:1401:7)
      async tryProcessMessage (.../swingset-vat/src/kernel/kernel.js:1438:14)
      async Object.run (.../swingset-vat/src/kernel/kernel.js:1964:19)
  ...

warner · 2024-08-22T08:17:13Z

ok I think I see a pathway by which a voAwareWeakMap getting deleted could put us in a situation where a Presence vref (o-54) is in the possiblyRetiredSet despite there still being a vdata pillar.

The slog trace of the failing run shows v15 doing a BOYD which does three passes through the scanForDeadObjects loop. On the third pass, we do a recognizer check for o-54 (ir.o-54) without first doing a refcount check, which I think is the possiblyRetiredSet scan (the second of the three clauses). It didn't show any previous recognizer checks.

For o-54 to have gotten into possiblyRetiredSet after the second pass cleared the set, but before the third pass did the check, would require the droppedCollectionRegistry callback to fire during the gcAndFinalize()at the top of the third pass. That would call finalizeDroppedCollection(), which would be routed to voAwareWeakMapDelete(), which would walk all keys of the weakmap and add their vrefs to possiblyRetiredSet.

The scenario I'm visualizing is where the vat has a voAwareWeakMap, this o-54 Presence is added as a key, then userspace drops the weakmap. On the next BOYD, the first two passes do their gcAndFinalize(), but for some weird V8 reason, that's not enough to shake the thing loose. Only on the third pass does it finally get collected, and the finalizer runs. The third pass' possiblyRetiredSet scan notices that o-54 indeed lacks any recognizers (neither in-RAM weakmap/weakset, nor durable ir.o-54 keys), and adds it to importsToRetire, despite the fact that o-54 is still reachable (from the o+d11/19 VOM state).

If the first pass was able to shake it loose, I'd expect finalizeDroppedCollection to be called during the first gcAndFinalize(), which would cause a recognizer check in the first pass, just before we start deleting the first batch of VOMs. It would still be buggy (we'd be retiring something without first dropping it), but the syscall trace would be different than what I captured from the failing run.

So maybe this condition could be reached without V8 being weird, which raises the concerning possibility that it could be reached from XS. It would require a Presence to be reachable from virtual data and also recognizable from a WeakMap or WeakSet, but not have a RAM pillar. The RAM pillar should be dropped, then a BOYD should be done. Then the WeakMap should be deleted, then we do a second BOYD. (If a single BOYD both processed the RAM-pillar drop and the possiblyRetiredSet scan, then the !deadSet.has(vref) clause would skip the isVrefRecognizable() and the retirement).

I think the fix will be for the possiblyRetiredSet scan to include a reachability check: if the vref is reachable, it must not be added to importsToRetire.

I don't understand the !deadSet.has(vref) clause in that loop. My vague memory is that it exists to avoid duplicating work, but it smells like it's also meant to guard against the not-dead-yet case.

warner · 2024-08-22T16:33:53Z

I was able to write a unit test that triggers this.. defintely a liveslots bug.

refs #9939

warner · 2024-08-22T18:12:22Z

Ok I think the requirements to trigger the bug are:

a Presence is used as the key of a WeakMap/WeakSet, and also held in a virtual/durable object state (or collection)
the Presence is dropped, and a BOYD collects it
then, the WeakMap/Set is deleted
then, a second BOYD notices the recognizable decref from the deleted weak collection, and emits the retireImports (which is bogus because the vref is still reachable, from vdata, which is why the vat did not previously emit a dropImports)

I don't yet know if using the vref in a (virtual/durable) WeakStore would trigger the problem.

@michaelfig tells me that @agoric/vow uses short-lived WeakSets to detect resolution cycles, and the new (as of upgrade16) vat-network uses @agoric/vow extensively. That probably explains why we didn't see this before.

I think it might also be exacerbated by the reapDirtThreshold.gcKrefs that I landed on trunk last week (for the benefit of slow-vat-deletion). That probably causes BOYDs to happen faster than before, which might increase the chance that the Presence is collected in a different BOYD than the WeakSet being collected.

mhofman · 2024-08-22T19:00:39Z

More specifically, Vow watch uses a WeakMap of WeakSet, keyed on a promise watcher (an exo object so virtualized) with the WeakSet containing presences or representiatives of vows.

The "then, the WeakMap/Set is deleted" would be caused by liveslots collecting the watcher in the first BOYD, which would cause the WeakSet to subsequently become collectible by the engine.

agoric-sdk/packages/vow/src/watch.js

Lines 69 to 82 in b84a426

    
           // We use an ephemeral WeakSet for the previously seen vows in a watch operation 
        
           // While watch is durable, it suffices to detect the cycle in a single incarnation 
        
           /** @type {WeakMap<PromiseWatcher, WeakSet<any>>} */ 
        
           const watcherSeenPayloads = new WeakMap(); 
        
           /** @param {PromiseWatcher} watcher */ 
        
           const getSeenPayloads = watcher => { 
        
             let seenPayloads = watcherSeenPayloads.get(watcher); 
        
             if (!seenPayloads) { 
        
               seenPayloads = new WeakSet(); 
        
               watcherSeenPayloads.set(watcher, seenPayloads); 
        
             } 
        
             return seenPayloads; 
        
           };

warner · 2024-08-22T20:41:03Z

So the way I want to fix this is to make the liveslots scanForDeadObjects loop be safely idempotent: it should be safe, just horribly inefficient, to put every vref (or at least every type: object vref, so every void) into both possiblyDeadSet and possiblyRetiredSet, and have scanForDeadObjects recognize the ones that aren't actually dead or retired. Part of the required behavior is that we only retire something if it was already dropped (or marked for drop by the end of the BOYD).

erights · 2024-08-22T23:23:53Z

For these short-lived WeakSets and/or WeakMaps as used by vows, what are the keys? What are the expected lifetimes of those keys? Do we expect such keys to have been used as keys in many short-lived WeakSets/WeakMaps?

I ask because that falls into the one case where XS weak gc is expensive, and is expected to remain expensive. Attn @phoddie @patrick-soquet

In general, when we know the WeakSet/WeakMap is expected to have a shorter lifetime than its keys, and its keys will be reused across several such WeakSets/WeakMaps, then we should ask if we should just use a short-lived Set/Map instead. If we can, we should. That's why I am relaxed about the remaining expense of that one case.

warner · 2024-08-22T23:55:42Z

So the way I want to fix this is to make the liveslots scanForDeadObjects loop be safely idempotent

I realized I can't do that, at least not for imports (Presences).

If we'd included an additional "import status" key (e.g. vNN.vc.vom.is.o-123), where we track the vref's state relative to the kernel (so reachable vs merely-recognizable vs nothing), then we could compare "has RAM pillar" (slotToVal.get(vref)) OR "has vdata pillar" (vom.rc.${vref}) against the import status. If the vref was merely-recognizable or missing entirely, but the import status says "reachable", then we emit a syscall.dropImports and change the import status to "merely-recognizable". If the pillars are missing and we have no recognizers for the vref, but the import status is "merely-recognizable", then we emit a syscall.retireImports and delete the import status.

We can't really retrofit that on now, and it would cost an extra DB key per vref so I'm not positive we'd be willing to pay that cost anyways. So we've got a constraint to manage: you must not add a vref to possiblyDeadSet that is not currently reachable. When we arrive in scanForDeadObjects(), we can rely upon every vref in possiblyDeadSet having an (implicit) import status of "reachable", and if our reachability check says "no", then we can safely dropImports it without contradicting any earlier syscalls. Likewise, you must not add a vref to possiblyRetiredSet that is not at least recognizable, so a recognizable===false result can safely do a syscall.retireImports.

One other improvement we can make is to remove some of the duplicate refcount checking. For a virtual/durable object (whose RAM pillar is a Representative), the first phase does a .rc.${vref} and .es.${vref} check (to see if the object is really dead), and the third phase deletes the VOM (which starts with an assertion of unreachability, which does a second .rc and .es check).

So I'm trying to sketch out a clean way to organize scanForDeadObjects that will avoid some of the duplication and also lead to a cleaner code path. I'm wondering if I can process one vref at a time, instead of handling all of possiblyDeadSet in one phase, then all of possiblyRetiredSet in a second phase, then all of deadSet in the third phase. And I'd like to separate the codepaths for our three kinds of trackable objects (with RAM pillars of Remotables, Representatives, and Presences, which map to vref shapes of o+NN, o+dKK/NN, and o-NN).

It may help to cache the reachability/recognizability check, but we have to be careful about cache invalidation. We could share the check between the first and second phases, but the moment we delete a VOM, we might invalidate the refcounts.

mhofman · 2024-08-23T00:51:46Z

I ask because that falls into the one case where XS weak gc is expensive, and is expected to remain expensive.

These are liveslots virtualized WeakSets used most likely (but not always) with virtual or durable objects as entries, as such the GC profile of XS does not come into play, but instead it's the liveslots gc behavior that does.

mhofman · 2024-08-23T00:56:45Z

If we'd included an additional "import status" key (e.g. vNN.vc.vom.is.o-123), where we track the vref's state relative to the kernel (so reachable vs merely-recognizable vs nothing)

This sounds related to my suggested fix for #9338: track in durable storage the full status of presences and representatives.

refs #9939 WIP: more precise failing test

These tests will fail without the fixes in the next commit One new test is disabled because of additional pending bugs that interfere with the test setup (a combo of #9956 and #9959).

This rewrites scanForDeadObjects(), which is called during dispatch.bringOutYourDead to process possiblyDeadSet and possiblyRetiredSet. The new flow should be easier to review and understand. The main behavioral difference is to fix a bug (#9939) in which a vref that appears in possiblyRetiredSet (because e.g. a weak collection was deleted, which was using that vref as a key), but which 1: lacks a RAM pillar (Presence object) and 2: was not dropped in this BOYD (e.g. it has a vdata pillar), used to be sent to the kernel in a bogus `syscall.retireImports()` call. Because this vref was not previously dropped by the vat (syscall.dropImports()), this was a vat-fatal error. The new code will only retire such a Presence vref if it was not reachable by the vat. fixes #9939

michaelfig · 2024-08-27T01:36:51Z

@warner the above commit that landed on master (064ff1a) is the one I'd like you to revert while closing this issue (#9939).

These tests will fail without the fixes in the next commit One new test is disabled because of additional pending bugs that interfere with the test setup (a combo of #9956 and #9959), which will be re-enabled in PR #9961 (to address #8756 and others)

@mhofman

This rewrites scanForDeadObjects(), which is called during dispatch.bringOutYourDead to process possiblyDeadSet and possiblyRetiredSet. The new flow should be easier to review and understand. The main behavioral difference is to fix a bug (#9939) in which a vref that appears in possiblyRetiredSet (because e.g. a weak collection was deleted, which was using that vref as a key), but which 1: lacks a RAM pillar (Presence object) and 2: was not dropped in this BOYD (e.g. it has a vdata pillar), used to be sent to the kernel in a bogus `syscall.retireImports()` call. Because this vref was not previously dropped by the vat (syscall.dropImports()), this was a vat-fatal error. The new code will only retire such a Presence vref if it was not reachable by the vat. The new tests are marked as expected to pass again. thanks @mhofman and @gibson042 for recommendations fixes #9939

@michaelfig

This reverts commit 064ff1a. Now that the underlying issue is fixed, we can re-enable this formerly-flaky test. Thanks @michaelfig for your patience.

Rewrite scanForDeadObjects(), which is called during dispatch.bringOutYourDead to process possiblyDeadSet and possiblyRetiredSet. The new flow should be easier to review and understand. The main behavioral difference is to fix a bug (#9939) in which a vref that appears in possiblyRetiredSet (because e.g. a weak collection was deleted, which was using that vref as a key), but which 1: lacks a RAM pillar (Presence object) and 2: was not dropped in this BOYD (e.g. it has a vdata pillar), used to be sent to the kernel in a bogus `syscall.retireImports()` call. Because this vref was not previously dropped by the vat (syscall.dropImports()), this was a vat-fatal error. The new code will only retire such a Presence vref if it was not reachable by the vat. fixes #9939

warner added liveslots requires vat-upgrade to deploy changes SwingSet package: SwingSet labels Aug 22, 2024

warner added a commit that referenced this issue Aug 22, 2024

WIP: failing test

7276bc5

refs #9939

warner mentioned this issue Aug 22, 2024

rewrite scanForDeadObjects to avoid retire-without-drop #9942

Merged

warner self-assigned this Aug 22, 2024

warner changed the title ~~syscall.retireImports translation error (under v8)~~ syscall.retireImports translation error (when deleting WeakSet) Aug 22, 2024

warner mentioned this issue Aug 24, 2024

using Remotable as weak collection key causes value to be pinned forever #9956

Closed

warner added a commit that referenced this issue Aug 26, 2024

WIP: failing test

c374e45

refs #9939 WIP: more precise failing test

warner mentioned this issue Aug 26, 2024

deleting/clearing a virtual collection will leak all vref-backed keys #9959

Closed

michaelfig added a commit that referenced this issue Aug 26, 2024

test(boot): skip basicFlows test until #9939 is fixed

1d43cac

michaelfig added a commit that referenced this issue Aug 27, 2024

test(boot): skip basicFlows test until #9939 is fixed

064ff1a

warner mentioned this issue Aug 27, 2024

null-upgrade orch vats to get fixed liveslots #9978

Closed

kriskowal pushed a commit that referenced this issue Aug 27, 2024

test(boot): skip basicFlows test until #9939 is fixed

c759ed4

mergify bot closed this as completed in #9942 Aug 30, 2024

mergify bot pushed a commit that referenced this issue Aug 30, 2024

Revert "test(boot): skip basicFlows test until #9939 is fixed"

c045163

This reverts commit 064ff1a. Now that the underlying issue is fixed, we can re-enable this formerly-flaky test. Thanks @michaelfig for your patience.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syscall.retireImports translation error (when deleting WeakSet) #9939

syscall.retireImports translation error (when deleting WeakSet) #9939

warner commented Aug 22, 2024 •

edited

Loading

warner commented Aug 22, 2024

mhofman commented Aug 22, 2024 •

edited

Loading

michaelfig commented Aug 22, 2024

warner commented Aug 22, 2024

warner commented Aug 22, 2024

warner commented Aug 22, 2024

mhofman commented Aug 22, 2024

warner commented Aug 22, 2024

erights commented Aug 22, 2024

warner commented Aug 22, 2024

mhofman commented Aug 23, 2024

mhofman commented Aug 23, 2024

michaelfig commented Aug 27, 2024

syscall.retireImports translation error (when deleting WeakSet) #9939

syscall.retireImports translation error (when deleting WeakSet) #9939

Comments

warner commented Aug 22, 2024 • edited Loading

warner commented Aug 22, 2024

mhofman commented Aug 22, 2024 • edited Loading

michaelfig commented Aug 22, 2024

warner commented Aug 22, 2024

warner commented Aug 22, 2024

warner commented Aug 22, 2024

mhofman commented Aug 22, 2024

warner commented Aug 22, 2024

erights commented Aug 22, 2024

warner commented Aug 22, 2024

mhofman commented Aug 23, 2024

mhofman commented Aug 23, 2024

michaelfig commented Aug 27, 2024

warner commented Aug 22, 2024 •

edited

Loading

mhofman commented Aug 22, 2024 •

edited

Loading