-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move snapstore (XS heap snapshots) into SQLite #6742
Comments
Quick observation: the temporary buffering of the compressed snapshot would also need to happen when making a snapshot, as there is the same streaming into DB limitation, as well as the inability to know the hash of the snapshot for the primary ID. The latter could be solved by using a primary ID generated randomly or incrementally, but that would likely require making sure that this primary ID is internal and not used in any consensus paths. However all that is unnecessary if there are no way to stream blobs from the DB. Btw we could imagine a chunking mechanism to avoid holding full compressed snapshots in memory, but that's effectively re-implementing streaming. I am also unconvinced that we need to store identical snapshots in the same table entry. This feels like an unnecessary optimization, where the potential space savings are not worth the complexity costs. |
Good points. I'm not worried about the RAM on the snapshot-write side (at least I'm equally non-worried about the write- and read- sides). So I think we read the stream from xsnap, feed each chunk into both the hasher and the compressor, accumulate the compressed data in RAM, then when the stream is done, we write the large compressed blob into the DB under its hash name. I agree that de-duplicating snapshots is not an important use case (and the practical chances of convergence are pretty low, especially if we update our "when do we take the first snapshot" code to make sure it includes all the deliveries we do during contract startup, which will probably make them completely diverge). I'm a big fan of hash-named files, but if we're saving them as blobs, then we might as well just use |
Was thinking we could simply select all rows for a particular vatID, sort by startPos, and only use the last one when loading from snapshot. That way removing old rows is simply a matter of pruning, which can be host defined. Also we probably should still store the computed hash of the uncompressed data along with the blob of the compressed data, since we will need it for consensus, state sync and debugability. For state sync however we did talk a few weeks ago about being able to mark a snapshot as "in use" by the host application while the state sync artifacts are being generated. Goal is to not do expensive operations when initiating a state sync snapshot, and instead leave that to the asynchronous processing which can span blocks if necessary. If we don't do reference counting on snapshot IDs and go with vatID+startPos instead, we may need the state sync logic to constraint XS snapshot pruning. Or we could just go the route of creating a read transaction on this table for state sync purposes. |
…tore This is phase 1 of #6742. These changes cease storing snapshots in files but instead keep them in a new table in the swingstore SQLite database. However, in this commit, snapshot tracking metadata is still managed the old way using entries in the kvstore, rather than being integrated directly into the snapshots table.
…tore This is phase 1 of #6742. These changes cease storing snapshots in files but instead keep them in a new table in the swingstore SQLite database. However, in this commit, snapshot tracking metadata is still managed the old way using entries in the kvstore, rather than being integrated directly into the snapshots table.
…tore This is phase 1 of #6742. These changes cease storing snapshots in files but instead keep them in a new table in the swingstore SQLite database. However, in this commit, snapshot tracking metadata is still managed the old way using entries in the kvstore, rather than being integrated directly into the snapshots table.
…tore This is phase 1 of #6742. These changes cease storing snapshots in files but instead keep them in a new table in the swingstore SQLite database. However, in this commit, snapshot tracking metadata is still managed the old way using entries in the kvstore, rather than being integrated directly into the snapshots table.
…tore This is phase 1 of #6742. These changes cease storing snapshots in files but instead keep them in a new table in the swingstore SQLite database. However, in this commit, snapshot tracking metadata is still managed the old way using entries in the kvstore, rather than being integrated directly into the snapshots table.
…tore This is phase 1 of #6742. These changes cease storing snapshots in files but instead keep them in a new table in the swingstore SQLite database. However, in this commit, snapshot tracking metadata is still managed the old way using entries in the kvstore, rather than being integrated directly into the snapshots table.
What is the Problem Being Solved?
The next step of #3087 is to move
snapStore
into SQLite too: this is the component ofswing-store
that holds XS heap snapshots. These heap snapshots are files, 2-20MB when compressed, created byxsnap
when it is instructed to write out the state of its heap. Thexsnap
process can be launched from a snapshot instead of an empty heap, which saves a lot of time (no need to replay the entire history of the vat).Currently,
swing-store
holds these in a dedicated directory (one file per snapshot), in which each file is named after the SHA256 hash of its uncompressed contents (.agoric/data/ag-cosmos-chain-state/xs-snapshots/${HASH}.gz
). ThekvStore
holds a JSON blob with{ snapshotID, startPos }
in thelocal.v$NN.lastSnapshot
key, to keep track of the vatID->snapshot mapping. It also holdslocal.snapshot.$id = JSON(vatIDs..)
to track the snapshot->vatIDs direction (remember that two vats might converge and use the same snapshot, e.g. newly-created ZCF vats running the same contract that have not diverged significantly yet).The one-file-per-snapshot approach effectively creates a distinct database, whose commit semantics are based upon an atomic rename (creating the
HASH.gz
file) and some eventualunlink()
syscall that deletes the file. These commit points are different than those of thekvstore
which references the files, requiring some annoying interlocks to make sure 1: we always add the file before adding thekvstore
reference, and 2: we never delete the file before committing the removal of the lastkvstore
reference.It would be a lot cleaner to record both the vat-to-snapshot mapping and the snapshots themselves in the same atomicity domain. Basically two tables:
During
commit
, or just after changing avatHeaps
entry, we can scanheapSnapshots
for unreferenced heaps and delete them.This will interact with @mhofman 's work to make
xsnap
read/write its heap by streaming it over a pipe, rather than writing it to a file. This also removes the need forxsnap
to have access to the filesystem, which will help with #2386 jail.Description of the Design
In addition to the new tables, I think the
swingStore.snapStore
component will have some different APIs. I think one pair to read/write snapshots (doing some streaming thing, maybe an AsyncIterator of uncompressed chunks), and a separate pair to either assign a vatID->snapshotID mapping, or clear the mapping (e.g. when upgrading or terminating a vat).The kvstore keys (
local.v$NN.lastSnapshot
andlocal.snapshot.$id
) will go away, in favor of the proper cross-table foreign keys. ThestartPos
field fromlastSnapshot
needs to be tracked next to the snapshotID: the possibility of convergence means that two different vats might conceivably arrive at the same heap snapshot but on different deliveryNums. This should probably coordinate with thestreamStore
, so they're all using a matcheddeliveryNum
or transcript entry index.Efficiency Considerations
We've had some concerns about putting large blobs in SQLite. I (warner) am pretty sure this will be fine. I found one article (https://www.sqlite.org/intern-v-extern-blob.html) examining read speed differences between external files and BLOBs, and for the (default) 4kiB pages we use, they report that external files can be read about twice as fast as in-DB blobs. I ran the tests on a follower node (SSD filesystem), and found the same difference. But note that we're talking about 544MBps for in-DB blobs, vs 965MBps for files on disk, so a typical 2MB compressed snapshot is going to load in a millisecond or two, and the extra speed isn't going to matter.
Using blobs from DB will require slightly more memory, because the SQLite API doesn't provide streaming access to the blob contents (it is delivered as a single large span of memory), whereas pulling files from disk could read just enough bytes to decompress the next chunk. So while we start a worker from heap snapshot, the kernel process will briefly require 2-20MB of RAM to hold the compressed snapshot data. This will be freed once decompression is complete. Note that we don't need to hold a copy of the decompressed data: we can stream that out as fast as the xsnap process can accept it, and never need to hold more than a reasonably-sized buffer.
Debugging Considerations
We might want a switch to disable the "delete unused snapshots" code, for archive nodes (@mhofman has found it awfully useful to have a way to retain all heap snapshots, for later forensics). To help correlate these with vats, maybe we should have a table of historical
(vatID, lastPos) -> snapshotID
mappings. Each time we update the main table, we also add an entry to this debugging table (but the debug table entries won't keep the snapshots alive, so either they aren't FOREIGN KEYs or the table only exists when we also remove the "delete unused snapshots" code, so the constraint is never violated).Security Considerations
Shouldn't be any.
Test Plan
Unit tests.
The text was updated successfully, but these errors were encountered: