Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compress/archive old transcript entries in swingstore/streamstore #6702

Closed
warner opened this issue Dec 20, 2022 · 7 comments
Closed

compress/archive old transcript entries in swingstore/streamstore #6702

warner opened this issue Dec 20, 2022 · 7 comments
Labels
enhancement New feature or request SwingSet package: SwingSet vaults_triage DO NOT USE

Comments

@warner
Copy link
Member

warner commented Dec 20, 2022

What is the Problem Being Solved?

@FUDCo and I were talking about an idea to reduce disk usage on the validators, by archiving (or completely removing) old transcript entries.

Strictly speaking, we only need vat transcript entries as far back as the most recent heap snapshot. We currently create those snapshots once every 2000 deliveries, so we never need to replay more than those 2000 to bring up a new worker (e.g. after we reboot the node, or if a previously-paged-out vat worker gets paged back in).

When we do a baggage-style upgrade of a vat, that also truncates the transcript: we don't need to replay the previous version's deliveries.

However, we currently retain all deliveries just in case we ever need to implement the "sleeper agent replay-style upgrade" (#1691). One form of this upgrade could start from the most recent version's starting point, and would replay all deliveries since that upgrade (more than we'd normally store, because heap snapshots don't help us). But we might find that e.g. some internal value was computed during version-1 but then forgotten, and by the time we're on version-3, that data has already been dropped from both in-memory state (upon the 1-to-2 upgrade), and it was never written into durable state (or it was subsequently erased from durable state). If we need that, the only way to reliably/credibly reacquire that data would be to have a sleeper-agent upgrade that replays all deliveries, starting from the very first startVat, simulating the behavior changes of all versions, and remembers the value until the trigger point (whereupon it can write it into durable state). This requires access to every transcript entry since the beginning of the vat's existence.

But, in the meantime, we're keeping an awful lot of data around just in case. The transcripts currently consume the majority of the swingstore space: our mainnet has stable kvStore size, heap snapshots are growing at 1.3 MB per day, but transcripts grow at 57 MB/day. (This is still tiny compared to the block data that cosmos/tendermint retains, even on a node with agressive pruning, but apparently block data can be shed by a state-sync refresh, whereas swingset data cannot). And it would be nice to reduce that.

So the idea would be to offload the old transcript data somehow. At the very least we could arrange to compress it, which would probably reduce the size by a factor of 100x (transcripts are very repetitive). We might also be able to remember just a hash, and leave responsibility for the actual (historical) data with an archive node (if we do this, then a sleeper-agent replay would require that everybody be able to retrieve a copy of this data to proceed, however they'd be able to get it from untrusted sources, and compare the hash against their saved state before using it).

Description of the Design

I'm thinking that we define a "transcript slab" as the set of transcript entries from one heap snapshot point to the next. So the first entry of the slab would be the first delivery after the heap snapshot was written, and the last entry would be the delivery just before the next snapshot write. The "current slab" is incomplete / mutable / still-being-appended-to, and is stored in the streamStore as usual. However, once the heap snapshot is written, it becomes the "previous slab", becomes immutable, and a new incomplete slab is started

When the heap snapshot is written, we can serialize the previous slab into a single file (using canonical JSON serialization of all entries, like we used to do in streamStore before we switched it to SQLite). We then hash that file to get the "transcript slab hash", and compress it. We write the slab hash (along with vatID, starting deliveryNum, ending deliveryNum, maybe the vat version/incarnationNum) into a DB table, and we store the compressed data into a different table (indexed by the slab hash, remaining prepared for two different vats to somehow create identical slab hashes). Then we truncate the streamStore entries by removing everything from the old slab (leaving it empty).

That would probably get us 100x savings on transcript data (on mainnet this might reduce our growth to 0.57 MB/day). If we wanted to go further and evict the old slabs entirely, then we'd store the hashes in a table forever, but the actual data would be deleted. We'd want a simple flag (used by archive nodes) that would instead write the compressed slab into a directory structure (indexed/named by vatID and starting deliveryNum). And we might want to additionally hash the compressed form (and demand that all nodes compress the same way: raw zlib, or maybe a canonical gzip without time/date headers), so that retrieving nodes can validate the compressed files they receive from untrusted archive nodes without needing to first decompress the contents.

I think this should be owned by swing-store: we should tell it when a heap snapshot has been written, and it should know that a transcript slab has been closed off. Then the swing-store mode can determine how to record and retain/delete the data. We should also build the API for retrieving the old contents (even though we won't be using it now), to make sure it would actually work.

Security Considerations

Deleting the transcript data would introduce an availability dependency: if we ever do perform a replay-style upgrade, every validator must first obtain a copy of the data from an archive node before they could proceed. Since we haven't yet implemented this kind of upgrade (and we probably won't, until we discover a serious need for it), we're only guessing about how it would work, and it seems likely that the "download the old transcript slabs" step would be left up to the validator operators, or implemented to pull from some centralized archive node (maybe IPFS?). All validators would validate the slab hashes against their local swingstore before using that data, but if they can't get a correct copy of the data, they can't perform the replay upgrade, which means they'll drop out of consensus.

This "availability hazard" exists in other forms: fetching each block is an availability question. But most of those are designed to be available from many parties (p2p gossip) in that moment, and these archived slabs would (by design) not be stored by most nodes, making the availability more tenous, especially in an adversarial situation where someone is deliberately preventing a particular validator from reaching an archive, to kick them out of consensus.

Test Plan

Unit tests of swing-store.

@warner warner added enhancement New feature or request SwingSet package: SwingSet labels Dec 20, 2022
@ivanlei ivanlei added the vaults_triage DO NOT USE label Jan 3, 2023
@mhofman
Copy link
Member

mhofman commented Jan 10, 2023

I suppose there is overlap here with the rolling hash for state-sync "transcript segments" and the plan to keep these hashes around, even across upgrades (described in #5542 (comment)). However I don't believe state-sync itself could help fetching evicted transcripts slabs/segments, and we'd need an external mechanism, relying on the availability of these artifacts through archives.

On the topic of compression, I recently wondered about that myself after noticing how much transcripts compressed. A lot is due to the shared structure between entries, and normalizing transcript data in SQLite could go a long way to reduce disk usage. For example having a separate table for syscalls (indexed on vatId + upgradeNum + deliveryNum), an enum for syscall kind, etc.
Then you also have the payload of those syscalls and deliveries which is highly compressible. A combination of approaches here could be useful, from reducing the payload size itself (e.g. through pattern-based compression), and plain old dictionary based compression.

@ivanlei
Copy link
Contributor

ivanlei commented Jan 23, 2023

Per #6594 (comment) this is no longer plan of record

@ivanlei ivanlei closed this as not planned Won't fix, can't repro, duplicate, stale Jan 23, 2023
@FUDCo
Copy link
Contributor

FUDCo commented Jan 23, 2023

I'm fairly sure closing this is a mistake. Certainly it can be deprioritized but we'll definitely need a way to do the stuff that's described here.

@mhofman
Copy link
Member

mhofman commented Jan 23, 2023

Yeah I think I mentioned the ability to archive would come almost for free after the state-sync implementation, but we'd still need something to do the archiving.

I think we don't need to look into compressing, for now at least.

@warner warner reopened this Jan 30, 2023
@warner
Copy link
Member Author

warner commented Jan 30, 2023

Yeah, I agree that this shouldn't have been closed. We may or may not need to include the transcript hashes in the activity hash (which is what #6594 was about), but we still need to make it convenient to delete old transcript entries (and/or compress them) to keep our disk-side storage from growing without bound.

I'm still of the opinion that we should remove transcript entries out of the "active" table once a heap snapshot has happened, and move them into an "inactive" table (where the rows are compressed spans, plus metadata and a hash), because that makes enumerating/extracting the artifacts for state-sync pretty easy. But I'm willing to be talked into a different approach.

@warner
Copy link
Member Author

warner commented Jan 30, 2023

I'm assigning this to @FUDCo and @mhofman to figure out as part of the state-sync / swingstore work, but of course compression isn't nearly as important as state-sync, and I'm ok if compression doesn't happen in time for the vaults/bulldozer release. If it happens as a side-effect, awesome, and please close this when the feature is in place. If it doesn't, please leave it open with some notes here about how we should approach it at some time in the future when the priority rises to the top of the queue.

@warner
Copy link
Member Author

warner commented Sep 9, 2023

replaced by #8318 which has a more detailed plan, and a new urgency (the mainnet swingstore is up to about 21 GB)

@warner warner closed this as completed Sep 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SwingSet package: SwingSet vaults_triage DO NOT USE
Projects
None yet
Development

No branches or pull requests

4 participants