Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decide about retention of old transcript entries #6775

Closed
warner opened this issue Jan 11, 2023 · 4 comments
Closed

decide about retention of old transcript entries #6775

warner opened this issue Jan 11, 2023 · 4 comments
Assignees
Labels
enhancement New feature or request SwingSet package: SwingSet vaults_triage DO NOT USE

Comments

@warner
Copy link
Member

warner commented Jan 11, 2023

What is the Problem Being Solved?

#6773, and state-sync in general, needs us to make a decision about how much of the vat transcript we retain. These transcript entries record each delivery to a vat (dispatch.deliver() for messages, dispatch.notify() for promise resolution, some others for GC events), as well as every syscall made by the vat during each delivery, plus the results of each syscall (especially ones like syscall.vatstoreGet).

The lifetime of a vat starts with createVat, which initializes its vatstore and begins execution with a V1 bundle. While this "incarnation"/version is active, we perform some (potentially large) number of deliveries. Every once in a while, currently every 2000 deliveries, we write out a heap snapshot. These incarnations are punctuated by upgradeVat events, where we retain the vatstore but discard the worker and start fresh from a new V2 bundle. Finally, the vat is terminated, and we can delete the worker, the transcript, and the vatstore.

This breaks up the transcript into a hierarchy of spans:

  • createVat
  • incarnation 1 is running
    • 1.0: transcript from V1 startVat up until first heap snapshot is recorded
    • 1.1: transcript from first heap snapshot until second
    • ..
    • 1.N: transcript from last heap snapshot until upgrade point
  • incarnation 2 is running
    • 2.0: transcript from V2 startVat until first heap snapshot
    • ..
    • 2.N:
  • terminateVat

(Note: we should also decide about 0- or 1- indexed incarnation numbers. Also we haven't clearly broken up our transcript into separate incarnations, but we probably should)

For our current functionality, we only need the entries since the most recent heap snapshot was recorded (the 1:N span). We use these to bring a worker online (e.g. when the host is rebooted, and we need new workers for all active vats). We load the worker from the heap snapshot, then replay all the deliveries since the snapshot was taken. We expect it to perform the same syscalls as the first time around, and we use the syscall results from the transcript to provide reponses to the worker's requests.

We've been retaining the ability to perform a "replay-style" repair/upgrade of a vat (#1691), in which we don't use a heap snapshot, and start from the beginning of the transcript. We can imagine a shorter form of replay, where we start at the most recent upgrade point (the latest incarnation), or a longer form where we start at the very very beginning (incarnation 1). We don't have any code or API to trigger this sort of upgrade, but we figured it was better to spend some extra disk/DB space if it buys us the ability to implement this in the future. We might find ourselves in some emergency situation where this sort of upgrade was the only way to fix a problem in the deployed system, and the disk space seemed pretty trivial. For reference, our pismoA chain (which, to be fair, experiences a fairly low swingset traffic rate) is currently growing the streamStore at a rate of 57MB/day, and this represents probably 98% of the swingset-state growth. It is tiny compared to how much cosmos-sdk is growing (850MB/day), but if they can fix their pruning bugs, we'd like swingset to not be the dominant source of space consumption.

However, now that we're looking at state-sync, the total amount of data deliveried to new validators is important, as it both adds to the expense of running one, and adds to the time it takes to bring up a new one.

Regardless of what we decide about retention, our plan is:

  • break up the transcripts into (vatID, incarnation number, endPos) spans, as described above
    • for each vat, there will be an "active span", to which we're appending new entries, and zero or more "old spans", which are now immutable
    • each span has a hash, which is built cumulatively (hash_0 = sha256(entry_0); hash_1 = sha256(hash_0 + sha256(entry_1)); hash_2 = sha256(hash_1 + sha256(entry_2)); .. until we get hash_N at the end)
      • of course each entry must be serialized in a deterministic way (we currently use djson for comparison, but plain JSON for storage)
    • these must be validated a whole span at a time, but realistically that's also the unit of execution or analysis, so no big deal
  • swing-store has an API to retrieve an iterator of entries for a given (vatID, incarnationNum, endPos) tuple, both for old spans and the active span
  • swing-store also has an API to report back the current/highest (incarnationNumber, endPos) for a given vat, so vat-warehouse can learn which is the active span
  • the activityHash and consensus data (swing-store export/restore API (for state-sync) #6773 export data) remember the span-hash for all spans, both current and old
  • each span can be validated against those hashes

If we then take the reduced-retention path we're considering, that means:

  • state-sync export includes an artifact blob for each vat's current span, but not for any old spans
  • swing-store deletes the old spans as soon as a heap snapshot is taken (i.e. as soon as the snapStore entry for a given vatID is overwritten by a new snapshot/endPos pair)
    • unless that swing-store was put into an archival mode, in which case the spans are retained indefinitely
    • it might be nice to add a flag column on that table, so a simple external sqlite3 swingstore.sqlite 'DELETE FROM transcript_spans WHERE active=0' could delete them all
  • DB usage for the transcripts (streamStore) is relatively constant, or at least O(num_vats) rather than O(num_vats * num_deliveries)
    • as is the size of the state-sync data, and time to bring up a new validator
    • not quite constant, because we're adding a single hash for each old span, but those are pretty tiny
  • if we ever find ourselves needing to perform a replay-style upgrade:
    • every validator would need to contact some sort of archive node
    • they would retrieve all the necessary spans (perhaps just the current incarnation, or perhaps everything)
    • they'd recompute the per-span hash from the alleged spans they just got, and compare it against the local hash in their swing-store, and not proceed unless they match
    • this would incur an availability risk: each validator that fails to acquire the necessary data in time would not be able to perform the replay upgrade, and would then fall out of consensus
    • at a minimum, we would want to run an archival node ourselves, and arrange to push the span blobs into S3 or some CDN for easy distribution in the future

We might consider changing the format of old spans to reduce the cost of validation (in all cases), and/or the space they consume (for archival nodes, or if we do not choose to do this pruning):

  • rewrite the list of entries into a single large blob, perhaps one-entry-per-line with djson-formatted entries
  • hash the blob, forming the span hash
  • compress the blob
  • store (vatID, endPos, compressedBlob, spanHash) into a separate table of old spans
    • note that the hashing/validation procedure is very different for old vs new spans
    • validation is slightly faster: decompress and stream through a hasher, rather than needing to parse out each entry and do a cumulative hash
  • remove the individual entries from the "current span entries" table
  • change swingStore to provide access to the old spans with an API that knows how to decompress and split the old blobs into an iterator of entries

Back when I worked on Tahoe-LAFS, we used "alacrity" to refer to the amount of alleged data you needed to fetch (and hash) to obtain a small amount of validated data: Merkle trees provide O(logN) alacrity, flat hashes provide O(N). You want to design your hash structure to match your use cases. For SwingSet, the primary use case is to bring up all workers from the most recent snapshot (a new validator doesn't know which vats are going to be active, so they should be prepared to spin up all of them, which means fetching data for all vats). Some secondary use cases are to replay a single vat from the latest snapshot (for debugging purposes), followed by replaying a whole latest incarnation of a single vat (either debugging or a replay-style upgrade), followed by replaying all incarnations of a single vat, followed by doing something with all vats. Since anyone can query any (untrusted) RPC node for the contents of any IAVL key (and a Merkle proof that traces up to a publically-verifiable block root), or maybe even a range of clustered keys, we don't need to be particularly clever about how we store the hashes. If we're ok with the secondary use cases needing to fetch O(N) blobs, then we can use keys like ${vatID}.${incarnationNum}.${endPos}, and most of the use cases will be querying for a contiguous range of keys, which won't exactly line up with the IAVL Merkle tree but should still be pretty good.

Decisions To Make

  • 1: are the space- and state-sync-download-time- savings worth the availability risk we're adding to a sleeper-agent/replay-style upgrade scenario?
  • 2: if so, how should we ensure availability of that data? run an archive node, hire people to run them, both, store in IPFS, S3, etc

Description of the Design

Security Considerations

As with #6773, the primary concern is that transcripts are correctly validated against a chain-verified hash (the integrity requirement). A secondary concern is the availability of transcript data.

@warner warner added enhancement New feature or request SwingSet package: SwingSet labels Jan 11, 2023
@otoole-brendan otoole-brendan added the vaults_triage DO NOT USE label Jan 11, 2023
@warner
Copy link
Member Author

warner commented Jan 11, 2023

related to #6702

@arirubinstein
Copy link
Contributor

cc @raphdev

@warner
Copy link
Member Author

warner commented Jan 13, 2023

In yesterday's kernel meeting, we concluded:

  • add a config knob to control pruning/deletion of old transcript spans
    • similar to Cosmos's pruning-keep-recent= and min-retain-blocks= in app.toml
    • default to retention
    • advertise the deletion option to validators (as part of a "here are some ways to save space" document)
    • provide a tool to delete the old ones without the knob (basically just a sqlite3 command)
  • build some tooling to extract old spans and publish them to some highly-available storage platform (S3, IPFS)
    • encourage archive nodes to participate in that process
    • make it easy for anybody to e.g. ipfs pin the new data
  • future hypothetical replay-style upgrades will either require that the validator did not exercise the "save space" option, or they must first fetch the blobs from the published copies
    • if we were only upgrading a single vat, we could conceivably include these blobs in the upgraded software tarball
    • or just provide instructions on how to download them as part of the upgrade instructions
    • we should make it as simple as dropping a certain set of files in a specific directory before relaunching the node

@warner
Copy link
Member Author

warner commented Jan 30, 2023

decision made, closing ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SwingSet package: SwingSet vaults_triage DO NOT USE
Projects
None yet
Development

No branches or pull requests

3 participants