-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(swingset): get a small upgrade to work #4927
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -567,6 +567,27 @@ export function makeVatKeeper( | |
return true; | ||
} | ||
|
||
function removeSnapshotAndTranscript() { | ||
const skey = `local.${vatID}.lastSnapshot`; | ||
const epkey = `${vatID}.t.endPosition`; | ||
if (snapStore) { | ||
const notation = kvStore.get(skey); | ||
if (notation) { | ||
const { snapshotID } = JSON.parse(notation); | ||
if (removeFromSnapshot(snapshotID) === 0) { | ||
// TODO: if we roll back (because the upgrade failed), we must | ||
// not really delete the snapshot | ||
snapStore.prepareToDelete(snapshotID); | ||
} | ||
kvStore.delete(skey); | ||
} | ||
} | ||
// TODO: same rollback concern | ||
// TODO: streamStore.deleteStream(transcriptStream); | ||
const newStart = streamStore.STREAM_START; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we really want to reset the stream pointer to the start position or merely remove the prior transcript entries? Seems like for debugging purposes it would be nicer to maintain the continuity of transcript position across the upgrade, so that, for example, if one was looking at a historical record of a transcript there would be no ambiguity about which incarnation of the vat code it was associated with. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I was wondering about that too. The conclusion I came to was that the transcript is intimately tied to the bundleID being executed: it doesn't make sense to retain a transcript without also retaining the bundle. It's not like you can replay the concatenated transcripts on top of only the original source bundle and get to the current state. To rebuild the vat from nothing requires a data structure that looks like "load bundle1, execute these N1 transcript entries, then shutdown and load bundle2, then execute these other N2 entries, etc". To do that properly we'd need to put the bundleID in the That's not a bad design, but I haven't stopped to think about how long it would take to make the necessary changes to achieve it. I'll spend a bit evaluating that before I land this PR. At the very least I'll sketch out what the complete design would be so I can compare. Performance-wise, I was hoping to delete as much data as possible, and being able to delete the transcript and the old bundles is appealing. A new validator who's trying to catch up doesn't need to replay the transcripts from bundle-1: it's sufficient for them to replay from just the most recent There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Deleting the old transcript sounds fine to me. I was just thinking that having continuity of the crank numbers in the historical timeline would be helpful when debugging timelines that cross an update boundary. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like it's non-trivial to make this change now. We could change I'll open a new ticket to see if we can come back to this later. There's an open policy question as to whether we should retain the historical transcripts from the very beginning of time, or just enough to enable #1691 -style replay (which, to be clear, cannot benefit from replay beyond the most recent version, which is a significant downside of the null-upgrade step). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I created #4940 to track this idea. |
||
kvStore.set(epkey, `${JSON.stringify(newStart)}`); | ||
} | ||
|
||
function vatStats() { | ||
function getCount(key, first) { | ||
const id = Nat(BigInt(getRequired(key))); | ||
|
@@ -638,5 +659,6 @@ export function makeVatKeeper( | |
saveSnapshot, | ||
getLastSnapshot, | ||
removeFromSnapshot, | ||
removeSnapshotAndTranscript, | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little nervous about doing two
deliverAndLogToVat
calls within a single crank. While there's nothing specifically I can point at that seems wrong per se, up until now cranks and deliveries have been in 1:1 correspondence. In particular, we should be extra careful that there's nothing indeliverAndLogToVat
that bakes in assumptions that it is at the start of a crank on the way in or at the end of a crank on the way out.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is the first time we're taking advantage of the difference between "crank" and "delivery". I think
vatWarehouse.deliverToVat
and the slogger is entirely ready for it, but the part that I know needs some polish is on the delivery results. The only sort of errors we've dealt with so far are all the kinds that terminate the vat. We're now entering into territory where e.g.startVat
will soon be able to notice that userspace failed to reconnect all of the durable Kinds, and if that happens we want to roll back the entire upgrade. We'll need a response that says "hey caller, the worker is wedged and no longer viable, but if youabortCrank
and jettison the worker, you could resume from the earlier snapshot without problems". My plan is to get roll-back-failed-upgrade working after getting the successful upgrade paths done, at which point I'll be looking more closely at these delivery results and their error modes.