-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more efficient/transactional state storage #54
Comments
Background
The Batch starts when an inbound message (or ack) is delivered to the Mailbox device, and it enqueues a message for the comms vat to process. It can also start when a timer event is delivered to the Timer device, which behaves the same way. The host loop is responsible for calling In a blockchain environment, there will probably be one Batch per block, performed at the end of the block after all RequirementsFor SwingSet correctness, we must only commit the kernel state at the end of a crank. For blockchain correctness, we must only commit the kernel state at the same points that the underlying blockchain commits its state (at a block boundary). For performance, we might only commit kernel state at the end of a batch. There is a tradeoff between minimizing the number of database commit cycles (which is typically pretty expensive) and minimizing the progress that might be lost due to a crash. There are a number of problems that could prevent the turn from completing, and we must be careful to not commit (partial) state from a failed turn:
The kernel does not know that a batch has finished (it only sees the individual cranks), so it must be told explicitly when to save its state. The host does this by calling My Plan:(which is subject to change, because halfway through writing this we had an epic meeting that resulted in Agoric/SwingSet#149 and now I don't know what to build anymore) HostDBThe host is responsible for providing a "HostDB" object (endowment) to the kernel, with an API that takes string keys/values:
The We currently expect the host to use LevelDB or something similar to implement this API. We avoid requiring transactional semantics from the host's choice of database. If the kernel is not given a KernelDBThe kernel will implement a "KernelDB" layer which provides (in order of read priority):
The KernelDB exposes a key-value API which is somewhat similar to that of the HostDB, with Operations that modify state (like adding something to a C-List) does a KernelDB also provides several control functions to the kernel itself (the "kernel" object, as constructed by The kernel does not know that a batch has finished (it only sees the individual cranks), so it must be told explicitly when to save its state. The host does this by calling Later, when the host does The KernelDB API is thus:
|
.. as the index for kernel state data structures, instead of user-provided vatName/deviceName strings. This will make it easier to use key/value state storage (#144), since the keys will be more uniform (and shorter, and not vulnerable to people using the separator string in their vat name). Also it will make adding dynamic Vats easier (#19), as we'll just increment a counter instead of asking the user to discover a unique name. closes #146
.. as the index for kernel state data structures, instead of user-provided vatName/deviceName strings. This will make it easier to use key/value state storage (#144), since the keys will be more uniform (and shorter, and not vulnerable to people using the separator string in their vat name). Also it will make adding dynamic Vats easier (#19), as we'll just increment a counter instead of asking the user to discover a unique name. closes #146
.. as the index for kernel state data structures, instead of user-provided vatName/deviceName strings. This will make it easier to use key/value state storage (#144), since the keys will be more uniform (and shorter, and not vulnerable to people using the separator string in their vat name). They look like "vNN" and "dNN", where NN is a positive integer. This will also make adding dynamic Vats easier (#19), as we'll just increment a counter instead of asking the user to invent a unique name. closes #146
This is the first phase of #144, to narrow the API we need from a backing store. All state mutations go through the keepers, and the keepers deal exclusively with an object that uses string keys and string values. We still JSON-serialize this object on demand (at the end of the batch/block). The next phase will replace the buildVatController/buildKernel APIs to accept a Storage object, with get/set/getRange/deleteRange methods, and have the JSON-serialized backing store object live outside the kernel (in the host). A subsequent phase will introduce commitCrank/commitBlock methods on that Storage object, then a final phase will replace the JSON backing store with a proper database.
This is the first phase of #144, to narrow the API we need from a backing store. All state mutations go through the keepers, and the keepers deal exclusively with an object that uses string keys and string values. We still JSON-serialize this object on demand (at the end of the batch/block). The next phase will replace the buildVatController/buildKernel APIs to accept a Storage object, with get/set/getRange/deleteRange methods, and have the JSON-serialized backing store object live outside the kernel (in the host). A subsequent phase will introduce commitCrank/commitBlock methods on that Storage object, then a final phase will replace the JSON backing store with a proper database.
This is the first phase of #144, to narrow the API we need from a backing store. All state mutations go through the keepers, and the keepers deal exclusively with an object that uses string keys and string values. We still JSON-serialize this object on demand (at the end of the batch/block). The next phase will replace the buildVatController/buildKernel APIs to accept a Storage object, with get/set/getRange/deleteRange methods, and have the JSON-serialized backing store object live outside the kernel (in the host). A subsequent phase will introduce commitCrank/commitBlock methods on that Storage object, then a final phase will replace the JSON backing store with a proper database.
This is the first phase of #144, to narrow the API we need from a backing store. All state mutations go through the keepers, and the keepers deal exclusively with an object that uses string keys and string values. We still JSON-serialize this object on demand (at the end of the batch/block). The next phase will replace the buildVatController/buildKernel APIs to accept a Storage object, with get/set/getRange/deleteRange methods, and have the JSON-serialized backing store object live outside the kernel (in the host). A subsequent phase will introduce commitCrank/commitBlock methods on that Storage object, then a final phase will replace the JSON backing store with a proper database.
This is the first phase of #144, to narrow the API we need from a backing store. All state mutations go through the keepers, and the keepers deal exclusively with an object that uses string keys and string values. We still JSON-serialize this object on demand (at the end of the batch/block). The next phase will replace the buildVatController/buildKernel APIs to accept a Storage object, with get/set/getRange/deleteRange methods, and have the JSON-serialized backing store object live outside the kernel (in the host). A subsequent phase will introduce commitCrank/commitBlock methods on that Storage object, then a final phase will replace the JSON backing store with a proper database.
This rewrites the kernel state management, in support of #144. Previously, kernel/vat state was stored in a composite object, with everything being JSON-serialized into a single big string at the end of the batch/block. In addition, the kernel promise-management code was a bit casual about mutating that state from outside the keepers. In phase 1, we still use JSON-serialization, but the object is now reduced to a strict string/string key-value store, and the storage/keeper layer returns hardened objects (so all mutations must go through the storage API). In phase 2 (also in this PR), the internal API is refactored to reflect the upcoming DB interface. The JSON-serialized KV object is wrapped in a has/get/set -style API. This PR retains compatibility with the existing host API (i.e. `buildVatController()` still accepts an `initialState=` string, and still has a `.getState()` method that returns a string). The next phases will replace this with a host-provided DB object.
This removes c.getState(). Instead, the host should retain control over the hostDB object it provides to the controller, so the host can choose when the hostDB should commit a block's worth of changes. The kernel's Keepers use a read-cache to minimize cross-Realm calls and hostDB operations. This read-cache does not yet evict any entries, so a future task is to build some basic LRU-like policy for it. But the largest performance problem right now, the vat transcript, is specifically *not* kept in the cache, so memory usage should be reduced somewhat even without proper eviction. refs #144 Extra thanks to @gamedevsam in #164 for a recommendation in kernelKeeper, to replace for-loop starting points with named constants.
This removes c.getState(). Instead, the host should retain control over the hostDB object it provides to the controller, so the host can choose when the hostDB should commit a block's worth of changes. The kernel's Keepers use a read-cache to minimize cross-Realm calls and hostDB operations. This read-cache does not yet evict any entries, so a future task is to build some basic LRU-like policy for it. But the largest performance problem right now, the vat transcript, is specifically *not* kept in the cache, so memory usage should be reduced somewhat even without proper eviction. refs #144 Extra thanks to @gamedevsam in #164 for a recommendation in kernelKeeper, to replace for-loop starting points with named constants.
This removes c.getState(). Instead, the host should retain control over the hostDB object it provides to the controller, so the host can choose when the hostDB should commit a block's worth of changes. The kernel's Keepers use a read-cache to minimize cross-Realm calls and hostDB operations. This read-cache does not yet evict any entries, so a future task is to build some basic LRU-like policy for it. But the largest performance problem right now, the vat transcript, is specifically *not* kept in the cache, so memory usage should be reduced somewhat even without proper eviction. refs #144 Extra thanks to @gamedevsam in #164 for a recommendation in kernelKeeper, to replace for-loop starting points with named constants.
https://github.com/Level/awesome is a pretty amazing list of NPM modules in the LevelDB-ish ecosystem. I'm still trying to figure out how to deal with the asynchronous nature of the abstract-leveldown API, but if we can overcome that, the |
Using Level-ecosystem modules in the kernel (e.g. for the CrankBuffer or the Read Cache) would require these modules to work under SES, which might be easy or might be difficult, depending upon how enthusiastic they are for subdependencies. The whole The SwingSet kernel has a narrow interface with the host Realm, to prevent leaking objects across the Realm boundary (which could enable a sandbox breach). While we could build a Membrane that wraps Promises, it would be a hassle. The more difficult aspect is that the kernel really wants to be able to handle syscalls synchronously. A Our Vat execution model allows Vat Code (the ocap-layer) to use Promises internally, in particular inbound method invocations can return a Promise and the liveSlots layer does the right thing. But method sends ( Dean and I talked through some ideas, and didn't find anything really satisfactory. For now, we're going to use the "really aggressive read cache" approach. We require that hosts provide synchronous DB access as defined by the current storageAPI, and to do this with LevelDB, the host must read the entire state vector into memory at startup, so it can satisfy synchronous reads as the kernel operates. Writes cause the in-memory cache to be updated, as well as getting flushed out to disk for next time. This isn't great, but it gets me unblocked, and still reduces our memory footprint somewhat: we don't need to serialize a full copy of the state for each write. The memory footprint should be equal to one full copy of the state vector, plus a small epsilon during each write. The next-better approach is to recognize that the transcripts could be fetched asynchronously, since we only need them during startup, and can take as much time as we want. We append to the transcript during operation, but never need to read from them. And transcripts are the most annoying part of the state vector, since they grow forever. To benefit from this distinction, we might want a |
Notes from today's conversation with Chris: We're trying to address a use case where external users (in aggregate) have access to a "large number" (one zillion, in metric units, written
There are smaller tables too: the Comms Vat is talking to lots of remote machines, but not 1Z of them. And there are other use cases that are slightly easier to deal with: 1Z messages to/from a Vat which doesn't create separate objects for each one (e.g. a simple counter: the only state is a single integer, but the transcript is very long). We could solve this case with specialized state management for the transcript, but it wouldn't help the more general "1Z objects" case. (the half-box on the left is the Comms Vat, the half-box on the right is the Issuer Vat, and the kernel is in the middle) Our job is to move all of these zillion-sized tables off to secondary storage: on disk, not in RAM. Synchronous Host/Kernel
|
I think the "split into two threads and use a semaphore to block one" alternative is actually not that hard. Node.js already has the notion of worker threads (https://nodejs.org/api/worker_threads.html), which takes care of the threading and communication. For synchronization, I'd turn to the power of *nix, and use a filesystem FIFO as the semaphore. No need for FFI. Fundamentally, for communication, our main thread just needs to call receiveMessageOnPort at the right time to poll the message port, and block if the message is not available. I have a sample of how all this works together: https://gist.github.com/michaelfig/6d199ea95eab3ebdb918359612123a3e |
Hmm. Ok, so in one sense, the host can do whatever it likes as long as it provides the right (sync) API. So the kernel would be unaware of the complexity of multiple threads. Host authors would be obligated to know about it (although for Node.js-specific environments we could hide that complexity in a library). The host is also responsible for managing the block-level transactionality: choosing when to update the durable state from which the next incarnation will restart, keeping it aligned with any other stored state (e.g. cosmos-sdk blocks), and withholding externally-visible messages until the state that generated them has been recorded (hangover inconsistency). I'd be worried about performance.. we're basically serializing all reads and writes and shipping them through the kernel to the other thread, deserializing them again, then delivering them to the leveldown storage backend. Maybe not a big deal but I'd want to keep it in mind. I suspect we could reduce the complexity a little bit by using Do you think the worker/multiprocess approach is feasible to integrate into the golang/cosmic-swingset world? There are even more threads going on there. An FFI interface might be a simpler component to use. And we need to figure out a clean way to do this in an XS world. For that, I think we have do an FFI equivalent anyways, since all host authorities are coming from C code. On the whole I'm more inclined to do an FFI thing, but maybe this trick would let us put that off for a while. |
It's not really any more complex than adding a single additional worker thread. Multiprocess would need the same thing, backed by a socket/IPC of some sort. For XS, I'd suggest multiprocess, with a blocking interprocess message-passing C primitive rather than implementing a Worker interface for it, nor having to write everything in C. We can do all kinds of cheating (such as shared memory) if we find this is too slow.
I've only seen pipe(2) available as the I'm thinking that writing FFI for a simple block/unblock mechanism (such as C++17 semaphore) that is processwide would be even better than trying to wrangle with FIFOs or file descriptors, especially to port to Windows. Thanks for your comments. |
We'll need to decide how to break down this task. The |
The "reduce large tables" task is covered by a number of smaller tickets:
So I think we can close this now. |
Over in cosmic-swingset, https://github.com/Agoric/cosmic-swingset/issues/65 has been collecting ideas to improve the way we store the kernel state. Most of that work needs to happen here, in SwingSet, with an eye towards providing an API that cosmic-swingset can connect to.
The text was updated successfully, but these errors were encountered: