You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#6773, and state-sync in general, needs us to make a decision about how much of the vat transcript we retain. These transcript entries record each delivery to a vat (dispatch.deliver() for messages, dispatch.notify() for promise resolution, some others for GC events), as well as every syscall made by the vat during each delivery, plus the results of each syscall (especially ones like syscall.vatstoreGet).
The lifetime of a vat starts with createVat, which initializes its vatstore and begins execution with a V1 bundle. While this "incarnation"/version is active, we perform some (potentially large) number of deliveries. Every once in a while, currently every 2000 deliveries, we write out a heap snapshot. These incarnations are punctuated by upgradeVat events, where we retain the vatstore but discard the worker and start fresh from a new V2 bundle. Finally, the vat is terminated, and we can delete the worker, the transcript, and the vatstore.
This breaks up the transcript into a hierarchy of spans:
createVat
incarnation 1 is running
1.0: transcript from V1 startVat up until first heap snapshot is recorded
1.1: transcript from first heap snapshot until second
..
1.N: transcript from last heap snapshot until upgrade point
incarnation 2 is running
2.0: transcript from V2 startVat until first heap snapshot
..
2.N:
terminateVat
(Note: we should also decide about 0- or 1- indexed incarnation numbers. Also we haven't clearly broken up our transcript into separate incarnations, but we probably should)
For our current functionality, we only need the entries since the most recent heap snapshot was recorded (the 1:N span). We use these to bring a worker online (e.g. when the host is rebooted, and we need new workers for all active vats). We load the worker from the heap snapshot, then replay all the deliveries since the snapshot was taken. We expect it to perform the same syscalls as the first time around, and we use the syscall results from the transcript to provide reponses to the worker's requests.
We've been retaining the ability to perform a "replay-style" repair/upgrade of a vat (#1691), in which we don't use a heap snapshot, and start from the beginning of the transcript. We can imagine a shorter form of replay, where we start at the most recent upgrade point (the latest incarnation), or a longer form where we start at the very very beginning (incarnation 1). We don't have any code or API to trigger this sort of upgrade, but we figured it was better to spend some extra disk/DB space if it buys us the ability to implement this in the future. We might find ourselves in some emergency situation where this sort of upgrade was the only way to fix a problem in the deployed system, and the disk space seemed pretty trivial. For reference, our pismoA chain (which, to be fair, experiences a fairly low swingset traffic rate) is currently growing the streamStore at a rate of 57MB/day, and this represents probably 98% of the swingset-state growth. It is tiny compared to how much cosmos-sdk is growing (850MB/day), but if they can fix their pruning bugs, we'd like swingset to not be the dominant source of space consumption.
However, now that we're looking at state-sync, the total amount of data deliveried to new validators is important, as it both adds to the expense of running one, and adds to the time it takes to bring up a new one.
Regardless of what we decide about retention, our plan is:
break up the transcripts into (vatID, incarnation number, endPos) spans, as described above
for each vat, there will be an "active span", to which we're appending new entries, and zero or more "old spans", which are now immutable
each span has a hash, which is built cumulatively (hash_0 = sha256(entry_0); hash_1 = sha256(hash_0 + sha256(entry_1)); hash_2 = sha256(hash_1 + sha256(entry_2)); .. until we get hash_N at the end)
of course each entry must be serialized in a deterministic way (we currently use djson for comparison, but plain JSON for storage)
these must be validated a whole span at a time, but realistically that's also the unit of execution or analysis, so no big deal
swing-store has an API to retrieve an iterator of entries for a given (vatID, incarnationNum, endPos) tuple, both for old spans and the active span
swing-store also has an API to report back the current/highest (incarnationNumber, endPos) for a given vat, so vat-warehouse can learn which is the active span
If we then take the reduced-retention path we're considering, that means:
state-sync export includes an artifact blob for each vat's current span, but not for any old spans
swing-store deletes the old spans as soon as a heap snapshot is taken (i.e. as soon as the snapStore entry for a given vatID is overwritten by a new snapshot/endPos pair)
unless that swing-store was put into an archival mode, in which case the spans are retained indefinitely
it might be nice to add a flag column on that table, so a simple external sqlite3 swingstore.sqlite 'DELETE FROM transcript_spans WHERE active=0' could delete them all
DB usage for the transcripts (streamStore) is relatively constant, or at least O(num_vats) rather than O(num_vats * num_deliveries)
as is the size of the state-sync data, and time to bring up a new validator
not quite constant, because we're adding a single hash for each old span, but those are pretty tiny
if we ever find ourselves needing to perform a replay-style upgrade:
every validator would need to contact some sort of archive node
they would retrieve all the necessary spans (perhaps just the current incarnation, or perhaps everything)
they'd recompute the per-span hash from the alleged spans they just got, and compare it against the local hash in their swing-store, and not proceed unless they match
this would incur an availability risk: each validator that fails to acquire the necessary data in time would not be able to perform the replay upgrade, and would then fall out of consensus
at a minimum, we would want to run an archival node ourselves, and arrange to push the span blobs into S3 or some CDN for easy distribution in the future
We might consider changing the format of old spans to reduce the cost of validation (in all cases), and/or the space they consume (for archival nodes, or if we do not choose to do this pruning):
rewrite the list of entries into a single large blob, perhaps one-entry-per-line with djson-formatted entries
hash the blob, forming the span hash
compress the blob
store (vatID, endPos, compressedBlob, spanHash) into a separate table of old spans
note that the hashing/validation procedure is very different for old vs new spans
validation is slightly faster: decompress and stream through a hasher, rather than needing to parse out each entry and do a cumulative hash
remove the individual entries from the "current span entries" table
change swingStore to provide access to the old spans with an API that knows how to decompress and split the old blobs into an iterator of entries
Back when I worked on Tahoe-LAFS, we used "alacrity" to refer to the amount of alleged data you needed to fetch (and hash) to obtain a small amount of validated data: Merkle trees provide O(logN) alacrity, flat hashes provide O(N). You want to design your hash structure to match your use cases. For SwingSet, the primary use case is to bring up all workers from the most recent snapshot (a new validator doesn't know which vats are going to be active, so they should be prepared to spin up all of them, which means fetching data for all vats). Some secondary use cases are to replay a single vat from the latest snapshot (for debugging purposes), followed by replaying a whole latest incarnation of a single vat (either debugging or a replay-style upgrade), followed by replaying all incarnations of a single vat, followed by doing something with all vats. Since anyone can query any (untrusted) RPC node for the contents of any IAVL key (and a Merkle proof that traces up to a publically-verifiable block root), or maybe even a range of clustered keys, we don't need to be particularly clever about how we store the hashes. If we're ok with the secondary use cases needing to fetch O(N) blobs, then we can use keys like ${vatID}.${incarnationNum}.${endPos}, and most of the use cases will be querying for a contiguous range of keys, which won't exactly line up with the IAVL Merkle tree but should still be pretty good.
Decisions To Make
1: are the space- and state-sync-download-time- savings worth the availability risk we're adding to a sleeper-agent/replay-style upgrade scenario?
2: if so, how should we ensure availability of that data? run an archive node, hire people to run them, both, store in IPFS, S3, etc
Description of the Design
Security Considerations
As with #6773, the primary concern is that transcripts are correctly validated against a chain-verified hash (the integrity requirement). A secondary concern is the availability of transcript data.
The text was updated successfully, but these errors were encountered:
add a config knob to control pruning/deletion of old transcript spans
similar to Cosmos's pruning-keep-recent= and min-retain-blocks= in app.toml
default to retention
advertise the deletion option to validators (as part of a "here are some ways to save space" document)
provide a tool to delete the old ones without the knob (basically just a sqlite3 command)
build some tooling to extract old spans and publish them to some highly-available storage platform (S3, IPFS)
encourage archive nodes to participate in that process
make it easy for anybody to e.g. ipfs pin the new data
future hypothetical replay-style upgrades will either require that the validator did not exercise the "save space" option, or they must first fetch the blobs from the published copies
if we were only upgrading a single vat, we could conceivably include these blobs in the upgraded software tarball
or just provide instructions on how to download them as part of the upgrade instructions
we should make it as simple as dropping a certain set of files in a specific directory before relaunching the node
What is the Problem Being Solved?
#6773, and state-sync in general, needs us to make a decision about how much of the vat transcript we retain. These transcript entries record each delivery to a vat (
dispatch.deliver()
for messages,dispatch.notify()
for promise resolution, some others for GC events), as well as every syscall made by the vat during each delivery, plus the results of each syscall (especially ones likesyscall.vatstoreGet
).The lifetime of a vat starts with
createVat
, which initializes its vatstore and begins execution with a V1 bundle. While this "incarnation"/version is active, we perform some (potentially large) number of deliveries. Every once in a while, currently every 2000 deliveries, we write out a heap snapshot. These incarnations are punctuated byupgradeVat
events, where we retain the vatstore but discard the worker and start fresh from a new V2 bundle. Finally, the vat is terminated, and we can delete the worker, the transcript, and the vatstore.This breaks up the transcript into a hierarchy of spans:
createVat
startVat
up until first heap snapshot is recordedstartVat
until first heap snapshotterminateVat
(Note: we should also decide about 0- or 1- indexed incarnation numbers. Also we haven't clearly broken up our transcript into separate incarnations, but we probably should)
For our current functionality, we only need the entries since the most recent heap snapshot was recorded (the
1:N
span). We use these to bring a worker online (e.g. when the host is rebooted, and we need new workers for all active vats). We load the worker from the heap snapshot, then replay all the deliveries since the snapshot was taken. We expect it to perform the same syscalls as the first time around, and we use the syscall results from the transcript to provide reponses to the worker's requests.We've been retaining the ability to perform a "replay-style" repair/upgrade of a vat (#1691), in which we don't use a heap snapshot, and start from the beginning of the transcript. We can imagine a shorter form of replay, where we start at the most recent upgrade point (the latest incarnation), or a longer form where we start at the very very beginning (incarnation 1). We don't have any code or API to trigger this sort of upgrade, but we figured it was better to spend some extra disk/DB space if it buys us the ability to implement this in the future. We might find ourselves in some emergency situation where this sort of upgrade was the only way to fix a problem in the deployed system, and the disk space seemed pretty trivial. For reference, our pismoA chain (which, to be fair, experiences a fairly low swingset traffic rate) is currently growing the
streamStore
at a rate of 57MB/day, and this represents probably 98% of the swingset-state growth. It is tiny compared to how much cosmos-sdk is growing (850MB/day), but if they can fix their pruning bugs, we'd like swingset to not be the dominant source of space consumption.However, now that we're looking at state-sync, the total amount of data deliveried to new validators is important, as it both adds to the expense of running one, and adds to the time it takes to bring up a new one.
Regardless of what we decide about retention, our plan is:
hash_0 = sha256(entry_0); hash_1 = sha256(hash_0 + sha256(entry_1)); hash_2 = sha256(hash_1 + sha256(entry_2)); ..
until we gethash_N
at the end)djson
for comparison, but plain JSON for storage)(vatID, incarnationNum, endPos)
tuple, both for old spans and the active spanincarnationNumber
,endPos
) for a given vat, so vat-warehouse can learn which is the active spanIf we then take the reduced-retention path we're considering, that means:
snapStore
entry for a givenvatID
is overwritten by a new snapshot/endPos pair)sqlite3 swingstore.sqlite 'DELETE FROM transcript_spans WHERE active=0'
could delete them allstreamStore
) is relatively constant, or at leastO(num_vats)
rather thanO(num_vats * num_deliveries)
We might consider changing the format of old spans to reduce the cost of validation (in all cases), and/or the space they consume (for archival nodes, or if we do not choose to do this pruning):
djson
-formatted entries(vatID, endPos, compressedBlob, spanHash)
into a separate table of old spansswingStore
to provide access to the old spans with an API that knows how to decompress and split the old blobs into an iterator of entriesBack when I worked on Tahoe-LAFS, we used "alacrity" to refer to the amount of alleged data you needed to fetch (and hash) to obtain a small amount of validated data: Merkle trees provide O(logN) alacrity, flat hashes provide O(N). You want to design your hash structure to match your use cases. For SwingSet, the primary use case is to bring up all workers from the most recent snapshot (a new validator doesn't know which vats are going to be active, so they should be prepared to spin up all of them, which means fetching data for all vats). Some secondary use cases are to replay a single vat from the latest snapshot (for debugging purposes), followed by replaying a whole latest incarnation of a single vat (either debugging or a replay-style upgrade), followed by replaying all incarnations of a single vat, followed by doing something with all vats. Since anyone can query any (untrusted) RPC node for the contents of any IAVL key (and a Merkle proof that traces up to a publically-verifiable block root), or maybe even a range of clustered keys, we don't need to be particularly clever about how we store the hashes. If we're ok with the secondary use cases needing to fetch O(N) blobs, then we can use keys like
${vatID}.${incarnationNum}.${endPos}
, and most of the use cases will be querying for a contiguous range of keys, which won't exactly line up with the IAVL Merkle tree but should still be pretty good.Decisions To Make
Description of the Design
Security Considerations
As with #6773, the primary concern is that transcripts are correctly validated against a chain-verified hash (the integrity requirement). A secondary concern is the availability of transcript data.
The text was updated successfully, but these errors were encountered: