decide about retention of old transcript entries #6775

warner · 2023-01-11T00:25:10Z

What is the Problem Being Solved?

#6773, and state-sync in general, needs us to make a decision about how much of the vat transcript we retain. These transcript entries record each delivery to a vat (dispatch.deliver() for messages, dispatch.notify() for promise resolution, some others for GC events), as well as every syscall made by the vat during each delivery, plus the results of each syscall (especially ones like syscall.vatstoreGet).

The lifetime of a vat starts with createVat, which initializes its vatstore and begins execution with a V1 bundle. While this "incarnation"/version is active, we perform some (potentially large) number of deliveries. Every once in a while, currently every 2000 deliveries, we write out a heap snapshot. These incarnations are punctuated by upgradeVat events, where we retain the vatstore but discard the worker and start fresh from a new V2 bundle. Finally, the vat is terminated, and we can delete the worker, the transcript, and the vatstore.

This breaks up the transcript into a hierarchy of spans:

createVat
incarnation 1 is running
- 1.0: transcript from V1 startVat up until first heap snapshot is recorded
- 1.1: transcript from first heap snapshot until second
- ..
- 1.N: transcript from last heap snapshot until upgrade point
incarnation 2 is running
- 2.0: transcript from V2 startVat until first heap snapshot
- ..
- 2.N:
terminateVat

(Note: we should also decide about 0- or 1- indexed incarnation numbers. Also we haven't clearly broken up our transcript into separate incarnations, but we probably should)

For our current functionality, we only need the entries since the most recent heap snapshot was recorded (the 1:N span). We use these to bring a worker online (e.g. when the host is rebooted, and we need new workers for all active vats). We load the worker from the heap snapshot, then replay all the deliveries since the snapshot was taken. We expect it to perform the same syscalls as the first time around, and we use the syscall results from the transcript to provide reponses to the worker's requests.

We've been retaining the ability to perform a "replay-style" repair/upgrade of a vat (#1691), in which we don't use a heap snapshot, and start from the beginning of the transcript. We can imagine a shorter form of replay, where we start at the most recent upgrade point (the latest incarnation), or a longer form where we start at the very very beginning (incarnation 1). We don't have any code or API to trigger this sort of upgrade, but we figured it was better to spend some extra disk/DB space if it buys us the ability to implement this in the future. We might find ourselves in some emergency situation where this sort of upgrade was the only way to fix a problem in the deployed system, and the disk space seemed pretty trivial. For reference, our pismoA chain (which, to be fair, experiences a fairly low swingset traffic rate) is currently growing the streamStore at a rate of 57MB/day, and this represents probably 98% of the swingset-state growth. It is tiny compared to how much cosmos-sdk is growing (850MB/day), but if they can fix their pruning bugs, we'd like swingset to not be the dominant source of space consumption.

However, now that we're looking at state-sync, the total amount of data deliveried to new validators is important, as it both adds to the expense of running one, and adds to the time it takes to bring up a new one.

Regardless of what we decide about retention, our plan is:

break up the transcripts into (vatID, incarnation number, endPos) spans, as described above
- for each vat, there will be an "active span", to which we're appending new entries, and zero or more "old spans", which are now immutable
- each span has a hash, which is built cumulatively (hash_0 = sha256(entry_0); hash_1 = sha256(hash_0 + sha256(entry_1)); hash_2 = sha256(hash_1 + sha256(entry_2)); .. until we get hash_N at the end)
  - of course each entry must be serialized in a deterministic way (we currently use djson for comparison, but plain JSON for storage)
- these must be validated a whole span at a time, but realistically that's also the unit of execution or analysis, so no big deal
swing-store has an API to retrieve an iterator of entries for a given (vatID, incarnationNum, endPos) tuple, both for old spans and the active span
swing-store also has an API to report back the current/highest (incarnationNumber, endPos) for a given vat, so vat-warehouse can learn which is the active span
the activityHash and consensus data (swing-store export/restore API (for state-sync) #6773 export data) remember the span-hash for all spans, both current and old
each span can be validated against those hashes

If we then take the reduced-retention path we're considering, that means:

state-sync export includes an artifact blob for each vat's current span, but not for any old spans
swing-store deletes the old spans as soon as a heap snapshot is taken (i.e. as soon as the snapStore entry for a given vatID is overwritten by a new snapshot/endPos pair)
- unless that swing-store was put into an archival mode, in which case the spans are retained indefinitely
- it might be nice to add a flag column on that table, so a simple external sqlite3 swingstore.sqlite 'DELETE FROM transcript_spans WHERE active=0' could delete them all
DB usage for the transcripts (streamStore) is relatively constant, or at least O(num_vats) rather than O(num_vats * num_deliveries)
- as is the size of the state-sync data, and time to bring up a new validator
- not quite constant, because we're adding a single hash for each old span, but those are pretty tiny
if we ever find ourselves needing to perform a replay-style upgrade:
- every validator would need to contact some sort of archive node
- they would retrieve all the necessary spans (perhaps just the current incarnation, or perhaps everything)
- they'd recompute the per-span hash from the alleged spans they just got, and compare it against the local hash in their swing-store, and not proceed unless they match
- this would incur an availability risk: each validator that fails to acquire the necessary data in time would not be able to perform the replay upgrade, and would then fall out of consensus
- at a minimum, we would want to run an archival node ourselves, and arrange to push the span blobs into S3 or some CDN for easy distribution in the future

We might consider changing the format of old spans to reduce the cost of validation (in all cases), and/or the space they consume (for archival nodes, or if we do not choose to do this pruning):

rewrite the list of entries into a single large blob, perhaps one-entry-per-line with djson-formatted entries
hash the blob, forming the span hash
compress the blob
store (vatID, endPos, compressedBlob, spanHash) into a separate table of old spans
- note that the hashing/validation procedure is very different for old vs new spans
- validation is slightly faster: decompress and stream through a hasher, rather than needing to parse out each entry and do a cumulative hash
remove the individual entries from the "current span entries" table
change swingStore to provide access to the old spans with an API that knows how to decompress and split the old blobs into an iterator of entries

Back when I worked on Tahoe-LAFS, we used "alacrity" to refer to the amount of alleged data you needed to fetch (and hash) to obtain a small amount of validated data: Merkle trees provide O(logN) alacrity, flat hashes provide O(N). You want to design your hash structure to match your use cases. For SwingSet, the primary use case is to bring up all workers from the most recent snapshot (a new validator doesn't know which vats are going to be active, so they should be prepared to spin up all of them, which means fetching data for all vats). Some secondary use cases are to replay a single vat from the latest snapshot (for debugging purposes), followed by replaying a whole latest incarnation of a single vat (either debugging or a replay-style upgrade), followed by replaying all incarnations of a single vat, followed by doing something with all vats. Since anyone can query any (untrusted) RPC node for the contents of any IAVL key (and a Merkle proof that traces up to a publically-verifiable block root), or maybe even a range of clustered keys, we don't need to be particularly clever about how we store the hashes. If we're ok with the secondary use cases needing to fetch O(N) blobs, then we can use keys like ${vatID}.${incarnationNum}.${endPos}, and most of the use cases will be querying for a contiguous range of keys, which won't exactly line up with the IAVL Merkle tree but should still be pretty good.

Decisions To Make

1: are the space- and state-sync-download-time- savings worth the availability risk we're adding to a sleeper-agent/replay-style upgrade scenario?
2: if so, how should we ensure availability of that data? run an archive node, hire people to run them, both, store in IPFS, S3, etc

Description of the Design

Security Considerations

As with #6773, the primary concern is that transcripts are correctly validated against a chain-verified hash (the integrity requirement). A secondary concern is the availability of transcript data.

The text was updated successfully, but these errors were encountered:

warner · 2023-01-11T18:23:43Z

related to #6702

arirubinstein · 2023-01-11T18:25:05Z

cc @raphdev

warner · 2023-01-13T01:01:50Z

In yesterday's kernel meeting, we concluded:

add a config knob to control pruning/deletion of old transcript spans
- similar to Cosmos's pruning-keep-recent= and min-retain-blocks= in app.toml
- default to retention
- advertise the deletion option to validators (as part of a "here are some ways to save space" document)
- provide a tool to delete the old ones without the knob (basically just a sqlite3 command)
build some tooling to extract old spans and publish them to some highly-available storage platform (S3, IPFS)
- encourage archive nodes to participate in that process
- make it easy for anybody to e.g. ipfs pin the new data
future hypothetical replay-style upgrades will either require that the validator did not exercise the "save space" option, or they must first fetch the blobs from the published copies
- if we were only upgrading a single vat, we could conceivably include these blobs in the upgraded software tarball
- or just provide instructions on how to download them as part of the upgrade instructions
- we should make it as simple as dropping a certain set of files in a specific directory before relaunching the node

warner · 2023-01-30T18:54:28Z

decision made, closing ticket

warner added enhancement New feature or request SwingSet package: SwingSet labels Jan 11, 2023

otoole-brendan added the vaults_triage DO NOT USE label Jan 11, 2023

warner self-assigned this Jan 13, 2023

warner mentioned this issue Jan 14, 2023

snapshot / BOYD interval based on computrons #6786

Open

warner closed this as completed Jan 30, 2023

mhofman mentioned this issue Mar 21, 2023

Capture start/upgrade information in transcript #7199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decide about retention of old transcript entries #6775

decide about retention of old transcript entries #6775

warner commented Jan 11, 2023

warner commented Jan 11, 2023

arirubinstein commented Jan 11, 2023

warner commented Jan 13, 2023

warner commented Jan 30, 2023 •

edited

Loading

decide about retention of old transcript entries #6775

decide about retention of old transcript entries #6775

Comments

warner commented Jan 11, 2023

What is the Problem Being Solved?

Decisions To Make

Description of the Design

Security Considerations

warner commented Jan 11, 2023

arirubinstein commented Jan 11, 2023

warner commented Jan 13, 2023

warner commented Jan 30, 2023 • edited Loading

warner commented Jan 30, 2023 •

edited

Loading