-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove stopVat(): allow vat upgrade without participation of old vat version #6650
Comments
@FUDCo and I chatted about this a bit. Using @FUDCo mentioned that he's not sure we're getting a lot of value out of the LRU cache, compared to the confusion it creates when debugging, and that moving to a write-through cache seemed like a decent approach. I need to reread the code to remind myself of when exactly we benefit from the read-caching behavior (I know we have to do some operation on every userspace access, for determinism that is not sensitive to GC behavior), but Chip was pretty sure that we merely repeat the |
What is the Problem Being Solved?
Our current mechanism for upgrading vats requires two steps:
stopVat()
to the old version, giving it a chance to clean up and shut downstartVat()
to launch/resume everything under the new codeThe reasons we built upgrade with a
stopVat
include:stopVat
, at the cost of leaking some DB spaceBut we've talked about getting rid of the
stopVat
step, because:stopVat
without error?stopVat
; no user-space code is run, but it still raises the question of whether we should be doing some sort of meteringstopVat
, the vat is now un-upgradeable, and we're kinda stuckWhile we haven't really committed to removing it, we'd like to.
To do that, we need to have the kernel take over responsibility for everything we currently delegate to
stopVat()
.The
stopVat()
call currently does:startVat()
can notify the follower callbacks:agoric-sdk/packages/SwingSet/src/liveslots/liveslots.js
Lines 1529 to 1532 in 162fdf2
disconnectObject
:agoric-sdk/packages/SwingSet/src/liveslots/stop-vat.js
Line 307 in 162fdf2
syscall.abandonExport()
on their vrefs:agoric-sdk/packages/SwingSet/src/liveslots/stop-vat.js
Lines 315 to 318 in 162fdf2
agoric-sdk/packages/SwingSet/src/liveslots/stop-vat.js
Line 348 in 162fdf2
Description of the Design
First, we need to decide what to do about the LRU cache. Either we make it a write-through cache (so it's never stale, and doesn't need flushing), remove it entirely, or require a BOYD delivery in place of the
stopVat
(to allow the cache to be flushed, and incidentally providing one last chance to decrement references, which may reduce the amount of durable data retained across the upgrade boundary). Requiring a BOYD has many of the same issues as requiring stopVat: if the vat is really confused, it might not work, and thus break the upgrade. OTOH, BOYD avoids userspace interaction just as much as stopVat does, so the confusion would have to be in liveslots or lower for it to cause a problem.Doing a BOYD before upgrade does have the potential to reduce the amount of data being retained across the upgrade, which might make things faster (and/or leak less data, until we manage to implement a full mark/sweep of the vatstore). I'm undecided about how the simplicity of not doing it compares to the consequences.
Then, we need to reject all the vat's old Promises. We already have code to locate these, in
cleanupAfterTerminatedVat()
, inagoric-sdk/packages/SwingSet/src/kernel/state/kernelKeeper.js
Lines 855 to 856 in 162fdf2
stopVat
to record the list of these rejected vpids (in the vatstore, inside a durable collection namedwatchedPromises
), and code instartVat
to invoke the registered callbacks. If the kernel does the rejection, we'll need some way to inform thestartVat
about the vpids and the rejection data.stopVat
used tostartVat
to acquire an additional argument, with the list of vat-decided non-durable vpids that we rejected during upgrade, so it can walk the saved registration table and issue the callbacksdispatch.notify()
for all the rejected promises after thestartVat
syscall.subscribe()
to inform the kernel about the own-promises too, and the kernel could limit the notifies/etc to the ones that were subscribedNext, we need to abandon all the non-durable exported objects. Again, we have similar code already present in the vat-termination pathway,
agoric-sdk/packages/SwingSet/src/kernel/state/kernelKeeper.js
Lines 830 to 842 in 162fdf2
void
s, just as promise vrefs arevpid
s, but we don't use the term for obvious reasons) into "durable" and "non-durable".We want to limit the coupling between liveslots and the kernel, to give them each freedom to manage their own data structures without flag days or challenging upgrade steps, so we don't really want the kernel to know too much about how vrefs are structured. But I'm thinking that this durable/non-durable distinction might be worth bringing into the API.
Vat-exported vrefs start with
o+
orp+
ord+
(for device nodes). Originally, the object vrefs were strictlyo+NN
, but with the introduction of virtual/durable data, that expanded into things likeo+${kindID}/${instanceID}:${facetID}
(e.g.o+11/2:0
is the first facet of the second instance of Kindo+11
). The currentvoid
format is described in vatstore-usage.md . However the kernel doesn't know anything about the suffix: it will allocate krefs for anyo+*
that it sees, and the only special value iso+0
(which is automatically allocated for the root object during vat creation).I'm thinking that we could expand that slightly, and introduce
o+d*
to mean "durable exported object". The kernel could then tell which c-list entries are for durable objects, and which are not. The kernel could then abandon all non-durables (except theo+0
root object) during upgrade, without needing to ask the vat about which ones are which.(it might be cleaner to also introduce
o+e*
to mean "ephemeral exported object", and require allvoid
s to start with eithero+e
oro+d
, but I kind of like the pedagogical simplicity of talking abouto+12
ando-34
without always needing to include additional characters)This would allow the kernel to take responsibility for abandoning everything (promises and objects) that definitely goes away during the upgrade. It wouldn't immediately help with the GC-like release of durable objects which were only retained by now-deleted ephemeral/merely-virtual ones, but it would reveal more information to the kernel, that could help with some future process (e.g. if we also change
vatstore
to include kernel-visible slots, making the reference graph visible to a kernel-side mark/sweep operation).If we did the same for Promise vrefs (
vpid
s), i.e.p+dNN
, then we'd be prepared for durable promises, which the kernel would not reject during upgrade (because the vat retains some durable mechanism to resolve/reject them in the new version).Task List
startVat
argument with rejected promises and rejection object capdata to support durable promise watchers looking at own promisesvoid
s) with durable/non-durable status: annotate durable vrefs likeo+dNN
#6695stopVat
#6696processVatUpgrade
sequence add BOYD to upgradeVat sequence #7001dispatch.stopVat
, the kernel's call to it, and the rest ofsrc/liveslots/stop-vat.js
(including the commented-out code that performs object-graph chasing, and maybe the test code intest-upgrade.js
which attempted to verify it) (this ticket)Security Considerations
Doing less work during upgrade should reduce the risk of an upgrade going wrong, which improves security (or at least improves our ability to react to a security problem).
Removing
stopVat
reduces our reliance on vat code (liveslots, not userspace) slightly: a compromised liveslots could not longer avoid having non-durable promises be rejected during upgrade, nor could it prevent non-durable exports from being abandoned.Compatibility Considerations
Changes to the vref format (
o+d*
) require coordination between vat (liveslots) and kernel, as does changing or removing thestopVat
portion of the upgrade protocol, and changing which side is responsible for promise rejection and object abandonment. If we choose to implement this, but aren't able to deploy it before the bulldozer upgrade, then we'd need some sort of per-vat flag to tell the kernel what to do during an upgrade ("v1 vats should get astopVat
, >=v2 vats should have their c-lists walked and things rejected/abandoned").Test Plan
Lots of unit tests.
cc @FUDCo @gibson042
The text was updated successfully, but these errors were encountered: