-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compress/archive old transcript entries in swingstore/streamstore #6702
Comments
I suppose there is overlap here with the rolling hash for state-sync "transcript segments" and the plan to keep these hashes around, even across upgrades (described in #5542 (comment)). However I don't believe state-sync itself could help fetching evicted transcripts slabs/segments, and we'd need an external mechanism, relying on the availability of these artifacts through archives. On the topic of compression, I recently wondered about that myself after noticing how much transcripts compressed. A lot is due to the shared structure between entries, and normalizing transcript data in SQLite could go a long way to reduce disk usage. For example having a separate table for syscalls (indexed on vatId + upgradeNum + deliveryNum), an enum for syscall kind, etc. |
Per #6594 (comment) this is no longer plan of record |
I'm fairly sure closing this is a mistake. Certainly it can be deprioritized but we'll definitely need a way to do the stuff that's described here. |
Yeah I think I mentioned the ability to archive would come almost for free after the state-sync implementation, but we'd still need something to do the archiving. I think we don't need to look into compressing, for now at least. |
Yeah, I agree that this shouldn't have been closed. We may or may not need to include the transcript hashes in the activity hash (which is what #6594 was about), but we still need to make it convenient to delete old transcript entries (and/or compress them) to keep our disk-side storage from growing without bound. I'm still of the opinion that we should remove transcript entries out of the "active" table once a heap snapshot has happened, and move them into an "inactive" table (where the rows are compressed spans, plus metadata and a hash), because that makes enumerating/extracting the artifacts for state-sync pretty easy. But I'm willing to be talked into a different approach. |
I'm assigning this to @FUDCo and @mhofman to figure out as part of the state-sync / swingstore work, but of course compression isn't nearly as important as state-sync, and I'm ok if compression doesn't happen in time for the vaults/bulldozer release. If it happens as a side-effect, awesome, and please close this when the feature is in place. If it doesn't, please leave it open with some notes here about how we should approach it at some time in the future when the priority rises to the top of the queue. |
replaced by #8318 which has a more detailed plan, and a new urgency (the mainnet swingstore is up to about 21 GB) |
What is the Problem Being Solved?
@FUDCo and I were talking about an idea to reduce disk usage on the validators, by archiving (or completely removing) old transcript entries.
Strictly speaking, we only need vat transcript entries as far back as the most recent heap snapshot. We currently create those snapshots once every 2000 deliveries, so we never need to replay more than those 2000 to bring up a new worker (e.g. after we reboot the node, or if a previously-paged-out vat worker gets paged back in).
When we do a baggage-style upgrade of a vat, that also truncates the transcript: we don't need to replay the previous version's deliveries.
However, we currently retain all deliveries just in case we ever need to implement the "sleeper agent replay-style upgrade" (#1691). One form of this upgrade could start from the most recent version's starting point, and would replay all deliveries since that upgrade (more than we'd normally store, because heap snapshots don't help us). But we might find that e.g. some internal value was computed during version-1 but then forgotten, and by the time we're on version-3, that data has already been dropped from both in-memory state (upon the 1-to-2 upgrade), and it was never written into durable state (or it was subsequently erased from durable state). If we need that, the only way to reliably/credibly reacquire that data would be to have a sleeper-agent upgrade that replays all deliveries, starting from the very first
startVat
, simulating the behavior changes of all versions, and remembers the value until the trigger point (whereupon it can write it into durable state). This requires access to every transcript entry since the beginning of the vat's existence.But, in the meantime, we're keeping an awful lot of data around just in case. The transcripts currently consume the majority of the swingstore space: our mainnet has stable
kvStore
size, heap snapshots are growing at 1.3 MB per day, but transcripts grow at 57 MB/day. (This is still tiny compared to the block data that cosmos/tendermint retains, even on a node with agressive pruning, but apparently block data can be shed by a state-sync refresh, whereas swingset data cannot). And it would be nice to reduce that.So the idea would be to offload the old transcript data somehow. At the very least we could arrange to compress it, which would probably reduce the size by a factor of 100x (transcripts are very repetitive). We might also be able to remember just a hash, and leave responsibility for the actual (historical) data with an archive node (if we do this, then a sleeper-agent replay would require that everybody be able to retrieve a copy of this data to proceed, however they'd be able to get it from untrusted sources, and compare the hash against their saved state before using it).
Description of the Design
I'm thinking that we define a "transcript slab" as the set of transcript entries from one heap snapshot point to the next. So the first entry of the slab would be the first delivery after the heap snapshot was written, and the last entry would be the delivery just before the next snapshot write. The "current slab" is incomplete / mutable / still-being-appended-to, and is stored in the
streamStore
as usual. However, once the heap snapshot is written, it becomes the "previous slab", becomes immutable, and a new incomplete slab is startedWhen the heap snapshot is written, we can serialize the previous slab into a single file (using canonical JSON serialization of all entries, like we used to do in
streamStore
before we switched it to SQLite). We then hash that file to get the "transcript slab hash", and compress it. We write the slab hash (along with vatID, starting deliveryNum, ending deliveryNum, maybe the vat version/incarnationNum) into a DB table, and we store the compressed data into a different table (indexed by the slab hash, remaining prepared for two different vats to somehow create identical slab hashes). Then we truncate thestreamStore
entries by removing everything from the old slab (leaving it empty).That would probably get us 100x savings on transcript data (on mainnet this might reduce our growth to 0.57 MB/day). If we wanted to go further and evict the old slabs entirely, then we'd store the hashes in a table forever, but the actual data would be deleted. We'd want a simple flag (used by archive nodes) that would instead write the compressed slab into a directory structure (indexed/named by vatID and starting deliveryNum). And we might want to additionally hash the compressed form (and demand that all nodes compress the same way: raw
zlib
, or maybe a canonicalgzip
without time/date headers), so that retrieving nodes can validate the compressed files they receive from untrusted archive nodes without needing to first decompress the contents.I think this should be owned by swing-store: we should tell it when a heap snapshot has been written, and it should know that a transcript slab has been closed off. Then the swing-store mode can determine how to record and retain/delete the data. We should also build the API for retrieving the old contents (even though we won't be using it now), to make sure it would actually work.
Security Considerations
Deleting the transcript data would introduce an availability dependency: if we ever do perform a replay-style upgrade, every validator must first obtain a copy of the data from an archive node before they could proceed. Since we haven't yet implemented this kind of upgrade (and we probably won't, until we discover a serious need for it), we're only guessing about how it would work, and it seems likely that the "download the old transcript slabs" step would be left up to the validator operators, or implemented to pull from some centralized archive node (maybe IPFS?). All validators would validate the slab hashes against their local swingstore before using that data, but if they can't get a correct copy of the data, they can't perform the replay upgrade, which means they'll drop out of consensus.
This "availability hazard" exists in other forms: fetching each block is an availability question. But most of those are designed to be available from many parties (p2p gossip) in that moment, and these archived slabs would (by design) not be stored by most nodes, making the availability more tenous, especially in an adversarial situation where someone is deliberately preventing a particular validator from reaching an archive, to kick them out of consensus.
Test Plan
Unit tests of swing-store.
The text was updated successfully, but these errors were encountered: