-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lazy schema upgrade / data-migration for virtual objects #7407
Comments
cc @FUDCo |
If we implement this, we can close #7337, which is now about allowing "compatible" changes to |
A few issues came up as I started to implement this:
I need to decide early whether to continue to serialize each property separately, or to switch now to marshalling all of |
why wouldn't that be a big deal?
I believe that's what the initial discussions called for
Why not do that right away ?
That doesn't seem backwards compatible. How can it differentiate a |
Basically, why anyone bother to put a constant vref in a Second, if someone did do that, they were already pinning the object for the whole incarnation, so obviously they aren't worried about GCing it anytime soon. If it's such a long-lived object, then it seems unlikely that a mere upgrade is going to reduce that lifetime by much.
Eh.. I guess it's only a moderate hassle to implement for new versions, but at this point we have no fast way to reconstruct the
Because we don't need to? Because
It doesn't need to be backwards compatible: only new versions of liveslots will be parsing this data. We only need the old vatstore values to be forward compatible: new versions of liveslots need to correctly handle const loadDurableKindDescriptor = kindID => {
const key = `vom.dkind.${kindID}.descriptor`;
const raw = syscall.vatstoreGet(key);
raw || Fail`unknown kind ID ${kindID}`;
const parsed = JSON.parse(raw);
if (!parsed.versions) {
parsed.versions = { '0': { stateShape: parsed.stateShapeCapData } };
delete parsed.stateShapeCapData;
}
saveDurableKindDescriptor(parsed);
return parsed;
}; compressionRereading my opening comment, at the time I said "I think we'll deploy compression and alternate versions at the same time.". I no longer think that: @erights 's experiments with compression showed a serious performance hit (time spent compressing), so I don't think we should implement compression unless we think the space savings is worth the slowdown, or if we can find ways to mitigate the slowdown. I think I like the idea of using |
There are several things about compression here that I need to clarify and/or correct. Could we fit a discussion of compression into the upcoming SwingSet kernel meeting? |
Our conclusions from today's kernel meeting:
And @FUDCo had a great idea: most instances don't have any const data = (typeof raw === 'string') ? { body: raw, slots: [] } : raw; |
I found some additional ways to reduce the size of the vatstore data, which also makes the compression Current EncodingCurrently, we have the Then later, when someone changes the We don't call Benefits of this approach:
Drawbacks:
A further disadvantage of the current scheme is that whatever pattern-based compression we do would be limited to one As an example, if With Proposed EncodingInstead of storing
The encoding will use newlines to separate two or three fields:
(and if Without compression, this shrinks the vatstore data to 37 bytes: With compression, it shrinks down to just 21 bytes: If the value contained a Remotable, the encoded vatstore would look like Benefits:
Drawbacks:
We really need to serialize at least the one modified property during the setter, to signal errors properly (immediately). We could imagine serializing only the modified property then, and wait until end-of-crank to serialize the whole record: this would remove some of the drawbacks, but raises the question of whether we update refcounts in the setter or at end-of-crank, and we'd still have duplicate serialization of the modified properties. The wasteful serialization is more significant if we have a large number of ConclusionsI think the space savings is worth the wasted marshalling calls. I need to run some experiments with mainnet data, but I think our VOs have fairly simple state. CompressionThe Pattern-based compression system that @erights built has an API that's roughly like: const holes = compress(stateShape, value);
const capdata = marshaller.serialize(holes); Where This means we don't strictly need to record a "compressed or not" flag in the KindDescriptor's I can't think of any good reason to allow some records (VOs) of a given version to be compressed, while allowing others to not be compressed. So I think the |
I would like to avoid any design decision that assumes |
I would really like us to solve this problem more pervasively rather than ad-hoc solutions for everywhere this problems crops up. I believe that not serializing CapData when we encode it would be one solution, but there may be others |
I hear you, and I can imagine us making improvements in message transport, but in this saved-state case, I don't see how we can traverse the path of types from "arbitrary tree of objects" (i.e. What did you have in mind? We could make vatstore accept object graphs/trees, but that just hides the necessary serialization down a layer. We could change For things like |
The approach detailed in endojs/endo#1478 is to change That way you can nest this CapData in any other serializable structure, and only perform a single Someday we'll be able to use the |
Oh, ok so I think you're changing the Let's see.. that new type is a superset of the old "string" type, however the actual stored data would not be backwards compatible: (Remember that we use the vatstore for both marshalled data, in virtual-object I don't know how to achieve backwards compatibility at that layer. The kernel (or whoever is sitting closest to the DB, implementing the vatstoreGet syscall) always gets a string from the |
As I was rereading this (swapping this stuff back into my head in preparation for resuming talking about acting on some of this), it struck me that we should provide a default |
Thinking further about the parallels with SQL databases, consider the ALTER TABLE statement (see, e.g., https://www.sqlite.org/lang_altertable.html) and the limitations on what it can do. A lot of the rules have to do with things like PRIMARY KEY or UNIQUE constraints, which don't really effect us. I'm wondering if with a little bit of cleverness we can achieve most of our schema migration objectives with statically declared data (in particular, by adding default values to the shape definition) rather than by requiring the user to provide migration code. Consider that developers out in the world are going to be accustomed to the kinds of constraints that SQL imposes. In particular, while it does allow a column to be renamed (which I don't think is really a thing we need to support) it does not allow a column's type to be changed. Rather, in the wild what people do is add a new column for the new type and have the code that uses the data check the old vs. new constraints in place at the point the data is used. In effect, the developer does write migration code for this one kind of case (which, in my experience is actually quite rare) but does so that the point of use rather than in some separate place. One could argue that this is better because it locates all the things that need to understand the column's meaning together in one place. In also means that you aren't having to keep track of mapping which changes are in which versions or having to worry about explicit version numbers and when to increment them. |
A data encoding of schema migration strategy has a further advantage: We know it the time of its application is unobservable, so we don't have to be either completely lazy or completely eager to stay deterministic. This is much like the advantages we gain when we can use a pattern as an acceptance predicate, rather than expressing a predicate in code. |
I agree a data only migration is better, and likely compatible with the expectations of JS developers to "feature test" their state. That said, I think we should internally implement this data only migration on top of a schema version model, so that we remain compatible with a user provided migration function if the need arises. |
Would the existing patterns provide sufficient input to perform this data only migration? Is there always a default value that can be deduced from new fields added in the pattern? |
Not quite. But with a pattern and a copyRecord of default fields const makeNewRecord = (oldRecord, newPattern, newDefaults) =>
const newRecord = harden({ ...newDefaults, ...oldRecord });
mustMatch(newRecord, newPattern);
return newRecord;
}; |
In the top comment (#7407 (comment)):
and in #7407 (comment) :
We'll need to retain the serialized In a scenario where all instances hold e.g. the same Brand, the version-0 instances will each have their own refcount for the Brand, as will the recorded I'm thinking it may be useful to have counters for versions 1 and beyond, even though we won't have a (cheap) counter for version 0. It would let us delete the version-2 stateShape, some day. We could also schedule an (expensive) search for all remaining version-0 instances and forcibly upgrade them, maybe as a special And I'm thinking that the counters should include The cost of maintaining the counters would be an extra vatStore write for every delivery which adds, deletes, or upgrades a durable object. We might store the counters in the descriptor record (which we have to read anyways), to save a read, at the cost of the deserialization taking slightly longer. Or we might put them in a separate key. |
As an alternative or intermediate step until we implement version migration, we could instead allow partial state shapes upgrades for a subset of shape changes: existing fields must retain identical shapes (including exact identity of any remotable reference), and new fields are allowed only if they're optional (allow undefined values) In that case the new shape would strictly be compatible with the old state data, and can thus replace the state shape in the kind description. We would need to add refcounts to any new remotables referenced by the state shape (knowing that slot numbers might have changed). This requires a predicate to assert the "compatibility" of the state shape as described above, and changes to the state field accessors to handle possibly missing data in existing state records for these new fields. |
Note that #10200 is about the smaller step of just adding new fields, and this ticket (#7407) will remain for the larger task of arbitrary schema/data/shape upgrades (as well as enabling compression, and changing the vatstore representation to accommodate both backwards-compatibility and maybe performance). |
One thing not really captured here is what to do about instances that didn't use any state shape. Technically each of these instances could have different fields defined, and the behavior methods are not currently allowed to define new fields. The most JS thing to do would be to allow new fields to be defined dynamically, but that conflicts with the state object being sealed. One possibility would be to have a "meta operation" on the state (possibly exposed as a symbol named property) to define a new field and return a new state object (or we could make the |
We could also make the setter of |
Like the |
Interesting. The main problem I see with that approach is that the caller needs to make sure they discard the object they assigned to |
Yeah, good point. I agree. |
from #7338 (comment)
(note that this whole migration thing only applies to durable objects: merely-virtual objects do not outlive their incarnation, and are not ugpraded. This writeup uses "virtual objects" as a generic term, my apologies for the lack of precision)
The durable-Kind upgrade process gives authors a way to change the behavior of their virtual objects, but it does not yet have a way to change the shape of the data records during an upgrade.
@erights and @mathieu sketched out a scheme for this, and I wanted to capture it with enough detail for us to implement in the post-Vaults timeframe.
Current Implementation
In trunk today (ca. 0e49c36), each durable Kind has a
descriptor
vatstore key (vom.dkind.${kindID}.descriptor
, once #7369/#7397 lands). This is a JSON-serialized record of{ kindID, tag, stateShape }
, and some information about facets.The state of each virtual object is stored in
vom.${baseref}
as a JSON-serialized record of property capdata, namedcapdatas
. Thiscapdatas
record has one property (with a matching name) for every property of the initialstate
data, as produced by the Kind's initializer function. The values of this record are capdata records ({ body, slots }
). This means each property is marshalled separately. It also means the values are double-encoded before getting stored in the vatstore:marshal.serialize()
does one JSON layer, and then the capdatas record is serialized with another layer. It also means that every vatstore value starts with an{
open curly brace, since the value is always a JSON-encoded object.Record Versions
We'll introduce a new format for the per-object state record: an array of
[version, capdatas]
. When loading the state of an object, we'llJSON.parse(syscall.vatstoreGet(`vom.${baseref}`)
, and then usetypeof
to see if the result is an Array (and extract the version integer), or an Object (which has an implicit version of0
). New data will always be stored as an Array, but the code will be able to handle records stored by earlier versions that used an Object.These versions will be used as keys into the "stateShapes table". This table is stored durable in the descriptor as
descriptor.stateShapes
, as an array of[version, stateShapeCapData]
pairs, and then held in RAM in a Map keyed by the version. The capdatas record is compressed according to the matchingstateShape
, so we need to know which version was used in order to deserialize correctly.I think we'll deploy compression and alternate versions at the same time. I think we can say that any record whose capdatas is an Object is not compressed, and any record whose capdatas is an Array is compressed.
Upgrade Functions
We'll change the signature of
defineDurableKind
to add two new options:currentVersion
andmigrateData
(names TBD of course).currentVersion
declares that all data records created during this incarnation will be marked with this particular version (in addition to being constrained to the currentstateShape
). It defaults to0
, which matches the records created by the original API.When a method on an old durable object is invoked, and we need to build a
state
accessor object to pass to the behavior function, the VOM will load the state capdatas, deserialize it, and compare the record version againstcurrentVersion
. If they do not match, the deserializedoldData
is passed to the migration function, which is expected to perform some type-specific upgrade/migration algorithm and return anewData
object. ThisnewData
must match the currentstateShape
, and will be immediately serialized and written back to the vatstore (in a new record, with the updatedcurrentVersion
, so future retrievals that use the same version will not perform a migration).migrateData
throws, or fails to meet the currentstateShape
constraint, how do we signal the error? We're almost certainly about to invoke a method on the target durable object. The error will probably be thrown fromprovideContext()
in case that helps. With luck, we can arrange for the method invocation to fail, and maybe the caller will notice the error usefully.The
migrateData
function is obligated to tolerate any previous value ofcurrentVersion
, including0
(for records created before this API was introduced), unless the author has some reason to believe that all old records have been migrated. In the future we might introduce counters to help authors discover how many records remain of each old version, or (more likely) offline tools to scan the vatstore DB to generate these counts (since they aren't very useful to the vat itself).currentVersion vs stateShape changes
Userspace is likely to go through multiple upgrades without changing the shape or the interpretation of the data record. Many upgrades will modify only non-Kind code, or will modify only one Kind and leave the rest alone. And when a Kind is modified, it might only be the behavior functions that are changed, and they will continue to use the same data records and
stateShape
constraint.@erights raised the idea that we should track changes in
stateShape
to deduce transition points between schemas. A series of upgrades which use the samestateShape
would all be given the same schema version.I think it would be better to have userspace give us a distinct
currentVersion
value, instead of trackingstateShape
s. Imagine one version which stores{ balance: M.number() }
and means "number of tokens, as a float (JavaScriptNumber
)". Then, the authors realize that fixed-precision is better for balances, and 9 decimal places is sufficient, and upgrade to a second version which stores{ balance: M.number() }
, and means "integer number of nano-tokens, as a Nat". If we neededstateShape
to change, they would also need to change thebalance
property name tonanoBalance
or something, to artifically trigger the stateShape change sensor.And, we need to provide an
oldVersion
to themigrateData
function, which means we need to store a version in each data record, and the easiest way to let userspace correlate the value we store with the valuemigrateData
gets, is to have userspace provide the value in the first place, in the form ofcurrentVersion
. And if userspace is already providing acurrentVersion
that must change for the sake of their futuremigrateData
function, then the VOM should be free to act uponoldVersion !== currentVersion
as well.The text was updated successfully, but these errors were encountered: