-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
share Zoe/ERTP libraries among contracts #2391
Comments
#8416 explores current bundle usage, and determines that we could save 90% if we had this sort of sharing. |
Some updates in the 2.5 years since we first established this trajectory:
The reduction in cost is a good thing, but it also removes a deterrent against spam and abuse. I'd prefer that we implement some sort of format-check/filtering to the |
What is the Problem Being Solved?
@dtribble pointed out that we'd really like to share the Zoe/ERTP libraries between different contracts, so that each one doesn't need to bundle its own copy. This feeds into a fairly large (and really cool) feature, in which vats/programs/contracts are defined as a graph of modules, some of which are shared, some of which are unique to the vat/program/contract, and we amortize space/time/auditing/trust by taking advantage of that commonality.
Currently, each contract is defined by a starting file (in JS module syntax) which
export
s a well-knownstart
function. Typically this starting file willimport
a bunch of other modules: some written by the contract author, but many coming from Zoe and ERTP (math helper libraries,Issuer
, etc). The deployment process feeds this starting file to ourbundleSource
function, which is responsible gathering all the necessary code into a single serializable "bundle" object. This bundle is then transmitted to the chain, where it can be used to create new contract instances. The bundle is stored (as a big data object) in the Zoe vat when the contract is first registered (#46 is basically about storing this somewhere more efficient). On its way to Zoe, the bundle appears as a message argument in several intermediate steps: the vat that executes the deploy script, several comms/vattp vats, and some cosmos/tendermint transaction messages. Once on Zoe, each time the contract is instantiated, Zoe must send the bundle to a newly-created dynamic vat, which creates several more copies of the bundle data.The starting file lives in a
node_modules/
-style package directory, in which some particular version of Zoe and ERTP has been installed (e.g.node_modules/@agoric/ERTP/
contains those library files).bundleSource
follows the usual Node.js rules to consultpackage.json
and find the files to satisfy eachimport
statement. ERTP depends upon several other Agoric-authored modules, and those files get included too. We should build some tools to measure this, but I could easily believe that only 10-20% of the bundle contents come from the contract definition, while the rest comes from common libraries.The problem is that the resulting resource consumption is the product of three multiplicands: the size of the bundle (which includes both the unique top-level code plus all the supporting libraries), the number of times it appears during the installation/instantiation of a contract (i.e. showing up as arguments in several messages, through several vats), and the number of times it gets installed and/or instantiated.
The #46 blobstore work may reduce the times the data appears (by moving it out of messages and into some deeper kernel-managed data store), but it's still interesting to reduce the effective size of the bundle. The subtask of this ticket is to accomplish that by not storing multiple copies of data that is shared between multiple bundles, so we're only paying the cost of the unique components of each contract.
Simplifying the Solution
I love this topic because it overlaps with the Jetpack work I did many years ago. There are fascinating questions of early- vs late- linking, programmer intent expressed by
import
statements as petnames that are mapped through a community-fed (but locally-approved) registry table into hashes of code/behavior, auditing opportunities at various levels of the aggregation/build/delivery/evaluation/execution process, and dozens of other ideas that I've been itching to implement for a decade.But, to get something done sometime in the foreseeable future, I should narrow the scope somewhat. I'm thinking of a solution with the following pieces:
nestedEvaluate
module format),bundleSource()
should emit an artifact that contains a graph of modules (packages, imports, exports) where all the actual code is identified by hash, and a big table mapping hash to the code string for that one moduleimportBundle
with the bundle identifer. At this point, we need our Compartments and Endo to let us supply the module graph in pieces, pulled from the blobstore, rather than needing to touch every byte of every module through a syscall (to keep the large blob data out of the transcripts). Maybe a specialsyscall.importEndoArchive(moduleGraphBlobcap)
. This doesn't need to go back to the kernel, but it should be able to fetch all the module source code from the blobstore, and evaluate it into a new graph of Compartments.Starting Points
We can achieve intermediate savings by just implementing portions of this plan. The most important thing is Endo, so we can supply a module graph as a bunch of pieces, rather than only as a single monolithic bundle object. When ZCF does an
importBundle
, it supplies a bunch of objects (maybe a graph and a table of blobs). We still deliver this collection of blobs everywhere (no message-size savings), but Zoe can deduplicate them for storage, so Zoe doesn't consume extra RAM or secondary storage for the redundant copies of the libraries. That'd be the first win.The second win could come if the deployment script could send a list of hashes to Zoe, and receive back a list of the hashes it doesn't already know about. Then the deployment script could send a subset of the component module sources. Zoe would hash them and store them indexed by their hash. Then the deployment script sends the module graph piece (which references everything else by hash), instead of sending the redundant library data.
The third win will be to move this storage into the kernel, so Zoe can send blobcaps to ZCF instead of the full sources. We have to figure out the right API for mapping between hashes and blobcaps (it would be nice if userspace code didn't know about hashes, only opaque blobcaps). Once the prerequisites are in place, one approach would be for Zoe to send a big table of blobcaps to ZCF, ZCF uses syscalls to retrieve the bytes for each blobcap into RAM, ZCF reconstructs the module-contents table, then ZCF feeds the module graph and the contents table to an Endo-based
importBundle
. A second (better) approach would be for ZCF to give a single module-graph blobcap tovatPowers.importBundle
orsyscall.importBundle
, and have something outside of userspace do the blobcap lookups to find all the components that Endo needs to load the contract module graph.Security Considerations
When we share modules between separate contracts, we are of course only sharing "static module records" (basically the source code of each module). We do not share instances, so that e.g. a
Map
defined at the top level of some module cannot be used as a communication channel between unrelated contracts. We also want to prevent this sharing between multiple imports of the same module graph within a single vat/contract. We know that more sophisticated tools could make this sharing safe (and perhaps saving us some RAM), by checking the module for the "DeepFrozen" property, but that involves a lot of static-analysis language work that we're not going to do in the near future.To achieve savings, we'll be removing source code from the
bundle
and replacing it with references to source code that arrive via a different path. This must not enable someone to swap out source code. The use of hash-based identifiers should prevent this, but we must implement it properly: use a suitably secure hash function, and make sure nothing outside the hashed content can influence the resulting behavior. The API for adding blobs to the blobstore should acceptdata
, not ahash
, so there is no temptation for the blobstore to merely accept the word of the submitter (instead, the blobstore should compute its own hash, obviously, store the data under that computed hash, and then return the hash so the caller can confirm it matches their expectations).For now, I think the deployer-submitted module graph should identify all modules by the hash of their contents. A later extension might replace this with e.g. a well-known library name and version identifier, or even a looser constraint like "the best version of
libfoo
that is compatible with API version 12". This would change the authority model: rather than the developer choosing exactly the source code to use, they would leave that choice up to something on the chain. On the plus side, this gives some later authority (perhaps managed by a governance vote) an opportunity to fix bugs and improve performance without the involvement of the original author. On the other hand, this enables interference by third parties, and expands the end-users TCB to include those upgradersThe text was updated successfully, but these errors were encountered: