Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(SwingSet): move promise rejection from stopVat to the kernel #7130

Merged
merged 25 commits into from
Mar 17, 2023

Conversation

gibson042
Copy link
Member

@gibson042 gibson042 commented Mar 7, 2023

closes: #6694

Description

  • Reject orphaned promises in terminateVat
  • Reject orphaned promises in processUpgradeVat
  • Remove special-casing in durable promise watchers

Security Considerations

This is an overall increase in robustness against problems with dying vats.

Scaling Considerations

I don't believe there are any changes w.r.t. scaling.

Documentation Considerations

All changes would be internal, but we should still do a scan before merging.

Testing Considerations

No changes.

@gibson042 gibson042 requested a review from warner March 7, 2023 23:48
@gibson042 gibson042 marked this pull request as draft March 7, 2023 23:48
@datadog-full-agoric
Copy link

datadog-full-agoric bot commented Mar 7, 2023

Datadog Report

Branch report: 6694-kernel-disconnect-promises
Commit report: a0ac542

agoric-sdk: 0 Failed, 0 New Flaky, 3664 Passed, 55 Skipped, 15m 28.59s Wall Time

@gibson042 gibson042 force-pushed the 6694-kernel-disconnect-promises branch from d87c370 to 90e156c Compare March 8, 2023 00:00
@gibson042 gibson042 marked this pull request as ready for review March 9, 2023 02:23
Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! We'll need to move that new test into swingset/, remove the new dependencies, and write some new narrow-style tests to live in liveslots/ . Grab me tomorrow and we can pair on those tests if you like.

@@ -2,85 +2,88 @@

Kernels and vats communicate about promises by referring to their VPIDs: vat(-centric) promise IDs. These are strings like `p+12` and `p-23`. Like VOIDs (object IDs), the plus/minus sign indicates which side of the boundary allocated the number (`p+12` and `o+12` are allocated by the vat, `p-13` and `o-13` are allocated by the kernel). But where the object ID sign also indicates which side "owns" the object (i.e. where the behavior lives), the promise ID sign is generally irrelevant.

Instead, we care about which side holds the resolution authority. This is not indicated by the VPID sign, and in fact is not necessarily static. Liveslots does not currently have any mechanism to allow one promise to be forwarded to another, but if it acquires this some day, then the decider of a promise could shift from kernel to vat to kernel again before it finally gets resolved. And there *is* a sequence that allows a vat to receive a reference to a promise in method arguments first, then receive a message whose result promise uses that VPID second (`p1 = p2~.foo(); x~.bar(p1)`, then someone resolves `p2` to `x`). In this case, the decider is initially the kernel, then the authority is transferred to the vat later.
Instead, we care about which side holds the resolution authority for a promise (referred to as being its **decider**). This is not indicated by the VPID sign, and in fact is not necessarily static. Liveslots does not currently have any mechanism to allow one promise to be forwarded to another, but if it acquires this some day, then the decider of a promise could shift from kernel to vat to kernel again before it finally gets resolved. And there *are* sequences that allow a vat to receive a reference to a promise in method arguments before receiving a message for which that same VPID identifies the result promise (e.g., `pk = makePromiseKit(); p2 = E(pk.promise).push('queued'); await E(observer).push(p2); pk.resolve(observer)` with an observer whose `push` returns a prompt response). In such cases, the decider is initially the kernel but later becomes the receiving vat.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree with updating the example to E() instead of tildot, and your rewording is a lot more clear. I'm not sure I understand the new example.

OTOH I'm not sure I understand my original either. Let's see:

  • p2 is decided by the kernel
  • p1 is also decided by the kernel, at least until p2 is resolved
  • x is an object in some vat
  • bar is sent into that vat, carrying p1 as an argument, so the vat knows that p1 will be decided by the kernel, later
  • p2 gets resolved to x, allowing the foo() to be unqueued and delivered to the vat that hosts x
  • the dispatch.deliver() of foo causes its result promise (p1) to change decidership from the kernel to the vat that hosts x
  • none of this requires pipelining or promise forwarding

Let's see, in your new example:

  • pk and p2 are local to the sending vat
  • E(observer).push(p2) to a remote vat causes p2 to be exported, so the kernel knows the sending vat is the decider of p2
  • by "push returns a prompt response" I'm guessing there's a crank boundary between the push(p2) and the pk.resolve()?
  • pk.resolve(observer) resolves the local promise to the remote Presence, allowing the push("queued") to be sent out into the kernel towards "observer"
  • the remote vat receives push("queued")

I don't think I see which promise is changing decider from kernel to receiving vat.

Copy link
Member Author

@gibson042 gibson042 Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my understanding of how decidership of p2 changes; please let me know if I've got anything wrong.

step1: pk = makePromiseKit();

  1. pk.promise is local to Alice

step2: p2 = E(pk.promise).push('queued');

  1. syscall.send(target=pk.promise, method='push', args=['queued'], result=p2)
  2. Kernel becomes the decider of p2 and notes that Alice is the decider of pk.promise

step3: await E(observer).push(p2);

  1. syscall.send(target=observer, method='push', args=[p2], result=<anonymous>)
  2. Kernel becomes the decider of <anonymous> and notes subscription by Alice
  3. Alice loses agency

step3a

  1. Kernel handles the step2 syscall.send by pushing onto the queue for pk.promise

step3b

  1. Kernel handles the step3 syscall.send by deliver(type='send', target=observer, method='push', args=[p2], result=<anonymous>) and making observer's vat the decider of <anonymous>
  2. observer does syscall.resolve(subject=<anonymous>, resolution=…)
  3. observer loses agency

step3c

  1. Kernel handles the syscall.resolve, which involves notify(vatID=Alice, promise=<anonymous>) to enqueue a message

step3d

  1. Kernel handles the notify by deliver(type='notify', subject=<anonymous>, resolution=…)
  2. Alice regains agency

step4: pk.resolve(observer);

  1. syscall.resolve(subject=pk.promise, resolution=observer)
  2. Alice loses agency

step4a

  1. Kernel handles the step4 syscall.resolve, which involves noting resolution of pk.promise to observer and requeueKernelPromise(pk.promise) to requeue messages to it

step4b

  1. Kernel handles the requeued step2 syscall.send by deliver(type='send', target=observer, method='push', args=['queued'], result=p2) and making observer's vat the decider of p2 (which it previously only knew from delivery arguments) and [hopefully!] also noting subscription by Alice to p2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step2: p2 = E(pk.promise).push('queued');
syscall.send(target=pk.promise, method='push', args=['queued'], result=p2)

Ah, ok since pk.promise is local, this E() is handled entirely by HandledPromise, and liveslots doesn't see it at all, until/unless pk.promise gets resolved to something that liveslots created.

(I'm betting that invalidates the rest of the sequence, so I didn't study it carefully yet, we can sync up on this in realtime and work out an example that demonstrates the problem, or maybe remove it from the docs.. I'm pretty sure I wrote the original to remind myself in the future, because it wasn't easy to reconstruct the problematic scenario, so it'd be nice to include at least a few hints)

packages/swingset-liveslots/src/vpid-tracking.md Outdated Show resolved Hide resolved
packages/swingset-liveslots/src/vpid-tracking.md Outdated Show resolved Hide resolved
packages/swingset-liveslots/src/vpid-tracking.md Outdated Show resolved Hide resolved
packages/swingset-liveslots/src/vpid-tracking.md Outdated Show resolved Hide resolved
@@ -1183,7 +1191,6 @@ function build(
// upgrade, or if we acquire decider authority for a
// previously-imported promise
if (pRec) {
// TODO: insist that we do not have decider authority for promiseID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I'm not 100% sure this goes away.

Let's see, if we resolve a watched local promise, currently the kernel won't send the notify back into us, because our syscall.resolve will delete the vpid from our c-list, which means the kernel's notify event will find the c-list entry missing, so the kernel can't deliver a dispatch.notify (it has no vpid to translate it into). So we won't observe that case.

(I think we would tolerate it anyways, if liveslots receives a notify for a promise whose vpid it doesn't recognize, it will ignore it. But I also think that case won't happen, at least not for these cancelled-by-upgrade promises)

If we watch a remotely-decided promise, the dispatch.notify will tell us about a promise that we aren't the decider for, so the assertion implied by the comment would be satisfied.

If we're a new incarnation, and we just finished walking our saved WatchedPromises table, and regenerated new Promise objects for everything there, we can pretend that the kernel is the decider for all of them. Some of them were decided by our predecessor, but we don't know that, and the kernel effectively claimed decidership when it shut down our predecessor.

Maybe just change it to "it is unlikely that we have decider authority for these: do some case analysis, and then maybe add an assertion".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does liveslots track decidership at all? It looks to me like this block handles a notify for any watched promise in the imported set regardless of how it got there, and that being in the imported set is a solid basis for proceeding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're probably right. Our comment near the importedVPIDs declaration claims that it only holds VPIDs for which the kernel is the decider.. if that's actually true, then this comment is stale (and a tautology).

I remember doing a lot of refactoring on this about a year ago, and maybe I introduced importedVPIDs/exportedVPIDs at that time, to clean up liveslot's approach to keeping track of decidership. I suspect it still isn't precisely correct (in the face of forwardings), but it's clearly trying to be.

So yeah, go ahead and delete that line. There's no better criteria for decidership than membership/not in importedVPIDs, so if there's any cleanup or other cases to inspect, it won't be in this function, it'll be in the dispatch.deliver that reacts to a previously-known result= VPID, or in the code that does syscall.send.

@@ -1424,7 +1431,7 @@ function build(
Fail`buildRootObject() for vat ${forVatID} returned ${rootObject} with no interface`;
// Need to load watched promises *after* buildRootObject() so that handler kindIDs
// have a chance to be reassociated with their handlers.
watchedPromiseManager.loadWatchedPromiseTable();
watchedPromiseManager.loadWatchedPromiseTable(unmeteredRevivePromise);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's our rationale for using unmetered revivePromise? I'm sure there might be one, but each time this comes up I have to think carefully about what it might or might not be sensitive to, and I'd like to have a clear reason for each place where we disable metering. I think it's usually a question of whether the function's behavior might depend upon the WeakRefs in slotToVal. If loadWatchedPromiseTable is strictly always adding things to that table (there's no chance for a vpid to already exist there), then I suspect we don't need/want it to be unmetered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preserves existing behavior, but I would like to leave enough breadcrumbs to avoid such doubt in the future.

The first line of revivePromise asserts absence of metering, and then the rest proceeds to call pRec = makePipelinablePromise(vpid), store the vpid–pRec mapping in importedVPIDs, register the vpid–promise mapping for the marshaller with registerValue, and return the promise.
revivePromise is called only by loadWatchedPromiseTable, which initializes an entry for the vpid–promise pair in promiseRegistrations and invokes pseudoThen to register settlement callbacks.

What commentary would you like to see at this invocation of loadWatchedPromiseTable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, and yeah we don't have the time to dive any deeper than what you've just mapped out. Maybe just add "TODO: why does this need to be unmetered?", and we'll eventually do another pass through and revisit those decisions.

packages/swingset-liveslots/package.json Outdated Show resolved Hide resolved
packages/swingset-liveslots/test/test-vat-upgrade.js Outdated Show resolved Hide resolved
@gibson042 gibson042 requested a review from warner March 14, 2023 01:27
@datadog-full-agoric
Copy link

datadog-full-agoric bot commented Mar 16, 2023

Datadog Report

Branch report: 6694-kernel-disconnect-promises
Commit report: 3c7d40b

agoric-sdk: 0 Failed, 0 New Flaky, 3744 Passed, 57 Skipped, 18m 51.48s Wall Time

Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly good, but let's see if we can remove that new vat-data dependency

packages/swingset-liveslots/src/vpid-tracking.md Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any problems with this refactoring, but if you adopt my request to avoid a liveslots->vat-data dependency, maybe it could go into a separate PR?

import { M } from '@agoric/store';
import { makePromiseKit } from '@endo/promise-kit';
// import { makeStoreUtils } from '../../vat-data/src/vat-data-bindings.js';
import { makeExoUtils } from '../../vat-data/src/exo-utils.js';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import feels problematic. At the very least, let's change it to @agoric/vat-data/src/exo-util.js (using ../.. is kinda cheating / assuming the monorepo layout, and I'm likely to be moving swingset-liveslots out to a separate repository, which will break ../../)

At a deeper level, Exos are not a liveslots concern. Liveslots provides Kinds, and userspace can use @agoric/vat-data to get Exos on top of that. We need to make sure that liveslot's defineKind works when used by makeExo stuff, but I think that should be the responsibility of vat-data rather than liveslots.

And I'm worried about the coupling this induces (especially because of my upcoming #6596 work): currently swingset-liveslots has a devDependency upon vat-data solely to get the types for DefineKindOptions and DurableKindHandle, and I think those types should be exported by liveslots rather than having liveslots import them from vat-data. With that change, we could shed the liveslots -> vat-data devDependency entirely, which is going to help with the #6596 packaging effort (the build process for xsnap-supervisor, which imports liveslots, can happen in a workspace/repo/node_modules that doesn't have swingset at all, which would prune a cycle).

Could you have this test work with a Kind instead of an Exo, and remove the dependency on vat-data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I thought I had already done that. Fixed.

@gibson042 gibson042 requested a review from warner March 16, 2023 20:19
Copy link
Member

@warner warner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok everything looks good except for the vpid-tracking.md line 5 "vpid forwarding" example. Maybe just leave that line the way it was (tildot and all), and we come back to it later for a more complete analysis? I'm pretty sure the original wording expressed whatever weird case it was that I thought of years ago (even if I don't remember how that case worked anymore), and I'm not convinced that the new wording accomplishes that, and I don't want to hold this up just for the sake of updating the example.

@warner warner added the SwingSet package: SwingSet label Mar 16, 2023
warner and others added 15 commits March 16, 2023 19:02
This makes room inside processUpgradeVat() for the new cleanup steps
that we want to add.

refs #6650
refs #7001
refs #6694
refs #6696
cleanupAfterTerminatedVat() had a loop which identified all the
promises (kpids) which were decided by a vat, so it could reject all
of them as the vat was being terminated. We will have a need for this
list of kpids during vat upgrade too, so refactor it out into a
separate kernelKeeper method, and change terminateVat() to fetch the
list before calling cleanupAfterTerminatedVat().
When upgrading a vat, we must reject all promises that were decided by
the old incarnation, because none of them are durable (the new
incarnation will be unable to resolve them). Previously, the vat's
stopVat() delivery would do the rejection. This moves the
responsibility to the kernel, to happen in the pause between the old
version being shut down and the new version being launched.

closes #6694
Increases clarity and convenience and sets up for future changes.
@gibson042 gibson042 force-pushed the 6694-kernel-disconnect-promises branch from 062317f to 8dab80f Compare March 16, 2023 23:05
@gibson042 gibson042 added the automerge:no-update (expert!) Automatically merge without updates label Mar 16, 2023
@mergify mergify bot merged commit 9d47dd0 into master Mar 17, 2023
@mergify mergify bot deleted the 6694-kernel-disconnect-promises branch March 17, 2023 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:no-update (expert!) Automatically merge without updates SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

have kernel reject unresolved vat-decided promises of an upgraded vat
2 participants