Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Latest commit

 

History

History
243 lines (150 loc) · 21.4 KB

2019.05-schema-block-interaction.md

File metadata and controls

243 lines (150 loc) · 21.4 KB

!!!

This document has moved.

You'll now find information like this in the ipld/ipld meta-repo, and published to the web at https://ipld.io/ .

All documentation, fixtures, specifications, and web content is now gathered into that repo. Please update your links, and direct new contributions there.

!!!


This document was archived from #118.

#118: Schema <> Block interaction + "advanced layouts logic"

Opened 2019-05-02T03:21:53Z by rvagg

Here’s my (likely faulty and in need of intervention) mental model for the primary way of using schemas: I’ve been imagining that an IPLD block could optionally have a predictable property at its root, _schema or $schema, that has association information in it. Maybe the schema is in there or there’s a CID pointing to a schema. The latter is nice because you can use it to associate with known types.

{
"_schema": { "ref": CID, "type": "FancyCollectionRootBlock" }
...
}

Where:

  • CID -> the schema text itself
  • FancyCollectionRootBlock -> the Type in the schema that this block will map to

or embed the schema text in the block itself with:

"_schema": {
"def": "... schema text ...",
"type": "NicelyShapedObjectFromUglyBlockLayout"
}

A loader encounters that and can load the schema from CID or just read it from the inline definition and create an appropriately instantiated form of the block according to the schema.

So myNicelyShapedObject = Loader.Load(someCID) would: read the block, find a _schema and know how to turn the raw block layout into NicelyShapedObjectFromUglyBlockLayout form. Maybe it’s a struct from a tuple representation or one of the ugly string concatenation representations but now it’s a beautiful native object/butterfly.

Or, for the case of objects which have associated logic, like a multi-block collection that is useless to the user in its decoded+instantiated form. We have a way of “registering” logic with the loader. Maybe the loaders we ship (whatever that is, js-ipld-stack, go-ipld-prime, or some layer above), have some of our standard types already registered. A certain CID might point to a schema of our standard HAMT, so we register the logic for that HAMT into the loader so when it encounters that CID it knows it can not only deode+instantiate a nicely shaped block but it can hand it off to that logic for further processing or pass that logic back to the user for their own handling.

So multiblockMap = Loader.Init(someCID) would: read the block, find a _schema, see that it has a CID that’s been pre-registered with some bit of code—let’s call it “advanced layout logic”. Decode and instantiate the block in its schema-defined form, pass it off to the “advanced layout logic” and give that back to the user. Maybe they get an object that they can do multiblockMap.Get(Key) on and it’ll do some cascading interaction between the loader and the “advanced layout logic” to traverse the tree in an appropriate manner, decoding+instantiating child nodes and figuring out how to traverse further down to Get what the user wants.

Maybe the “advanced layout logic” has to have some additional functionality associated with it to allow for interaction with selectors/paths? So you can traverse right into the block-spanning-data-structure with a selector, relying on the “advanced layout logic” that the loader knows about.

This CID-registration thing would allow for the loading of custom things that span one or more blocks, a user might make their own FancyCollection that we don’t need to care about, they register the CID for its shema with the loader, Init() a root block and get back their collection. It would also allow versioning of the data types. Maybe we build a HAMT this year with a single elementMap and decide next year that we really should have gone with separate CHAMP dataMap and nodeMap elements, so the root and child node blocks will be different and the logic to traverse will be different. So we just register those two different schema CIDs with two different sets of logic and users will get back a Map with the same API but it loads and traverses blocks differently under the hood.

I know I at least have holes in this thinking relating to how IPLD is currently used today, I’m focused on the “get me a collection that I can interact with programatically” case but maybe there’s a lot more (selector traversal?) and maybe that’s not even the right case to be thinking about.


(2019-05-02T03:33:01Z) mikeal:

I don’t see why we would not want to contain all the schema information in the schema instead of as metadata in the reference to the schema, so definitely:

{ _schema: CID() }

I’d also like to avoid describing the block boundaries and instead just state that the _schema property is the schema, and it may be a link or it could even be inline. In general, we should just avoid these distinctions and make sure our tools for reading things never assume a particular block structure and only a specific graph shape.

I don’t see why we would embed the DSL text for the schema rather than the parsed schema in JSON types. Then we could inline it or link to it encoded into any codec that supports the data model. This would also allow the tools to more easily embed versioning information or other parser requirements.


(2019-05-02T05:41:35Z) rvagg:

I don’t see why we would not want to contain all the schema information in the schema instead of as metadata in the reference to the schema

Well, that's mainly because many schemas will have to be constructed of multiple, inter-dependent types, so a loader will have to know which type it's loading in the current block.

type FooType struct {
bar BarType
baz String
yoiks Link<FooType>
}

type BarType struct { String : Int }

We'd need a rule like: the first type in a schema describes the current block, the rest are dependents. Unfortunately that would mean you can't use different parts of a schema for different purposes. You couldn't have a block that's a FooType with a _schema that points to this schema as well as a BarType with a _schema that points to the same schema used for a different purpose. That was the kind of flexibility I was imagining. Forcing a special case of the first type would also discourage mammoth general-purpose schemas (is that a good or a bad thing?).

I’d also like to avoid describing the block boundaries and instead just state that the _schema property is the schema, and it may be a link or it could even be inline. In general, we should just avoid these distinctions and make sure our tools for reading things never assume a particular block structure and only a specific graph shape.

OK, I think I just had a <click> moment. You keep talking about wanting to blur the block boundaries, so you're mainly talking about the tools making the boundaries transparent so the way we talk about them above the blocks doesn't have to involve things like "could be a CID or inline" because that's implied.

I don’t see why we would embed the DSL text for the schema rather than the parsed schema in JSON types.

Fair. Although that means we have to make sure we're paying as much attention to the parsed form (maybe we can call that the AST) as the DLS syntax. I'm trying to do that here, but it should also make it into this repo's schema spec too.


(2019-05-02T09:52:34Z) warpfork:

Re: schema <> block interaction:

There's a lot of different ways we can do this. I'm not sure any way is clearly more correct than any other, and it might often be correct to leave this choice to the user.

I did some writing on schema versioning and migration pathways which might be relevant: https://github.com/ipld/go-ipld-prime/blob/cd9283ddd86af15b9bc4a1f5b71f00fbfc2f8b94/doc/schema.md#schemas-and-migration

In particular there's even a section on "Strongly linked schemas".

Having explicit pointers to schemas is something we can do. I dunno about "should", though. It smells to me a lot like nominative typing, and I tend to feel we get better migrateability over time by aiming more towards the vicinity of structural typing. (Not that nominative typing isn't great in programming languages; but data description is a different ballgame...)

(EDIT: the next section may have taken "native type/butterfly" slightly too literally; if you meant "schema type", your text makes more sense to me now, and this next paragraph is still true but slightly off-topic.)

The big issue with explicitly linked schemas in the data itself is that it doesn't generally give you a way to translate into native types. Codegen (or our reflecty 'bind' stuff in go-ipld-prime) is something you get to do once at compile time. (Okay, I'm wearing very not-javascript^Wlisp-tinted glasses here -- suppose we have languages where eval at runtime is generally not a thing. :)) So, folks wanting to have native types matching their schema will generally run codegen once for one specific schema version... and then need to figure out how to map data that's from any other schema version into that form. Explicitly linked schema versions don't exactly block that, but it turns out they don't help, either.


Parameters to advanced layout logic being directly linked from the data is a different matter than the above rant about explicit schema links being problematic, though. (Concretely, supposing the HAMT code is already written, plugging params into that at runtime is fine, contrasted with plugging other schema content into codegen in at runtime, which is... not.)

(No comment yet on the rest of advanced layout logic part yet, need more time to parse that, so I'll make it a separate comment. I think I'm mostly nodding, though.)


(2019-05-02T13:40:56Z) warpfork:

+1 to always linking the IPLD form of schemas rather than the DSL. I guess this isn't committed to any documentation yet, but I also had some comments about the longevity of one vs the other over in ipld/go-ipld-prime#10 (comment) .


Forcing a special case of the first type would also discourage mammoth general-purpose schemas (is that a good or a bad thing?).

FWIW, I tend to imagine using a bunch of small schemas in many programs, yeah.

I think this probably also works better and is easier to imagine scaling up ecosystemically when going with the the structural-typing instead of nominative-typing take. One doesn't just get most of the same issues when composing things with the structural-typing point of view -- the schema versions just aren't there to have the potential to conflict.


Lots of nodding about the general concept of registering logic for handling some FancyCollection.

An interesting devil in the details is... should there be different treatment of the algorithm (e.g. "HAMT" -- it's one thing; that's easy enough to register) vs more parameters the algorithm might have (e.g. bucket width -- an int, thus countably infinite range, and not so clear how to register)?


There's a brief mention of potential relationship to selectors in the first post, and... I'd say, yeah, nah, I wouldn't try to worry about this.

We can use selectors in various forms of interaction with advanced data layouts, sure: either by using them with a schema that points out the advLayout, thus allowing a single path step to translate into the entire internal operation of the advLayout; or by using them with a schema that doesn't point out the advLayout, and points out all the internal types of the advLayout's guts instead, in which case it can get some fraction of the internal nodes, etc. That toggle is kinda cool.

But things like HAMT sharding instantly get into more ✨ logic ✨ than we'll probably ever be capable of expressing in selectors (short of going full wasm), so... IMO, we should lean into that freely, and ignore selectors entirely while developing advanced layouts. If it so happens that there's an advanced layout which is amenable to certain kinds of selector traversal mapping onto things like "the first half of this tree", that's great; if not, well, fine.


(2019-05-02T13:42:05Z) warpfork:

I wrote up another issue with some broad thoughts about high level design rules I've had cooking in the back of my head for a while, particularly with an eye towards convergence properties in the hashes in all the stuff we generate, and that also gets into a bunch of this territory: #119


(2019-05-02T16:31:05Z) mikeal:

Well, that's mainly because many schemas will have to be constructed of multiple, inter-dependent types, so a loader will have to know which type it's loading in the current block.

Ahhh, gotcha, ya I hadn’t realized yet that the schema definitions are quite literally just definitions and don’t present a clear definition that would be applied. In that case, I’d just change def to defs 😊


(2019-05-02T16:34:56Z) mikeal:

OK, I think I just had a moment. You keep talking about wanting to blur the block boundaries, so you're mainly talking about the tools making the boundaries transparent so the way we talk about them above the blocks doesn't have to involve things like "could be a CID or inline" because that's implied.

Right, but even more than that, you could imagine a large schema definition that had links throughout it (another reason I want to make sure we work with the parsed schema), linking to previously encoded definitions throughout. There are performance tradeoffs on either side of this, on one side you get better deduplication and caching and on the other side you get better loading time when there is no cache state if you inline everything. Given that we don’t know all the performance tradeoffs an application might want to make it’s best for us to just leave these open ended and think in terms of the data and its shape rather than where we expect things to link to together.


(2019-05-03T01:57:12Z) rvagg:

So, folks wanting to have native types matching their schema will generally run codegen once for one specific schema version... and then need to figure out how to map data that's from any other schema version into that form. Explicitly linked schema versions don't exactly block that, but it turns out they don't help, either.

I think this is getting to the divergence point between a JS implementation and a Go one (and friends of either down the road).

In JS it's easy to imagine being able to switch out versions of things, essentially providing future-proofing by allowing an endless set of versions of the same thing. Version 1 of a HAMT schema takes one path, encounter the version 2 of that schema, divert and take another, iterate for the next 25 years till we have 10 slightly versions of a HAMT, each "better" than the previous, but all within reach if our loader wants to instantiate an IPLD form of them. In fact we could ship them all in the same codebase.

With Go + codegen you're going to get into quite a mess if you need to cope with that kind of flexibility. I'm instinctively not keen on ever being able to say that this version of a collection we're shipping today is going to be the same type that you'll be using for the next 25 years!. "Oh, did you show up with the codegen form of HAMT v2? Sorry, we're looking at a v1 now".

How do you seeing addressing that kind of inflexibility into the future?


(2019-05-03T05:08:53Z) rvagg:

re that last comment about codegen, some relevant pieces from https://github.com/ipld/go-ipld-prime/blob/cd9283ddd86af15b9bc4a1f5b71f00fbfc2f8b94/doc/schema.md (which I hadn't seen till now, ooops!)

Thus, a valid strategy for longlived application design is to handle each major change to a schema by copying/forking the current one; keeping it around for use as a recognizer for old versions of data; and writing a quick function that can flip data from the old schema format to the new one. When parsing data, try the newer schema first; if it's rejected, try the old one, and use the migration function as necessary.

If you're using codegen based on the schema, note that you'll probably only need to use codegen for the most recent / most preferred version of the schema. (This is a good thing! We wouldn't want tons of generated code per version to start stacking up in our repos.) Parsing of data for other versions can be handled by ipldcbor.Node or other such implementations which are optimized for handling serial data; the migration function is a natural place to build the codegenerated native typed Nodes, and so each half of the process can easily use the Node implementation that is best suited.

I'll have to ponder this more, maybe that's good enough of an answer.


Still up for discussion is how to link to a schema, is it enough to just link to a CID or do you need to provide metadata to say what the entry point to the schema is? Is it enough to have the first type in the schema represent the parent entry point? My proposal for a _schema property would be as in OP, either "_schema": { "ref": CID, "type": "MyType" } or "_schema": { "defs": {...}, "type": "MyType" }. (aside: I'm not sure how best to represent that in the schema! I suppose it'd be a union of two structs, but what type would "defs" resolve to? Maybe a map of strings to ... any?)


Also the interaction between schema and the advanced layout logic, but it sounds like we all just have vague ideas about how that will work. So without further suggestions I'll just work through it with code, see what happens and come back with more concrete proposals.


(2019-05-05T20:11:50Z) warpfork:

"Oh, did you show up with the codegen form of HAMT v2? Sorry, we're looking at a v1 now".

I'm hoping in those cases, we'll have the codegen for golang emit a short snippet that's more or less a nicely-typed/nicely-autocompleting wrapper, and the wrapper will call out to library code which implements doing the HAMT-v$n logic in a very generic/weakly-typed way. In go-ipld-prime we already have very "generic" Node interfaces which should be a reasonable meeting place to make this happen. The library code for the advanced layout can stay then in the library, and we can choose to have multiple versions of it there. (Hopefully not too many versions, still, but "two" or other reasonable small numbers shouldn't be a stretch.)

is it enough to just link to a CID or do you need to provide metadata to say what the entry point to the schema is?

Yeah, I like your schemaCID+entrypointType pairing idea. 👍


(2019-05-06T10:34:10Z) vmx:

<@mikeal> I don’t see why we would embed the DSL text for the schema rather than the parsed schema in JSON types.

Do you mean the *.ipldsch.json representation or something else? If it's that representation, it can be stored within the IPLD Data Model. Which leads to…

<@rvagg> is it enough to just link to a CID or do you need to provide metadata to say what the entry point to the schema is?

If the schema itself is just IPLD, we could also use CID + Path to point to a specific type of the schema.


(2019-05-06T11:34:31Z) warpfork:

parsed schema *.ipldsch.json can be stored within the IPLD Data Model

👍

If the schema itself is just IPLD, we could also use CID + Path to point to a specific type of the schema.

We could, but

A) CID + Path is something that -- though it's been frequently alluded to -- has yet to actually land in a spec that's fully finished. Or rather, perhaps the "Merklepath" spec gets at it, but that has some serious todos about character encoding, and also doesn't fully cover how that's supposed to work if you encounter one as a "link". B) Does that add any value? I'm not sure it particularly does: any system looking at these references would have the semantics of the schema typesystem in mind already, so such an audience isn't really going to gain anything by having a more generalized system of reference.


(2019-05-06T11:50:59Z) vmx:

B) Does that add any value?

Given that we do that "CID + Path" thing: Uniformity ("you want something out of IPLD, use a path") and less concepts to learn.


(2019-05-06T22:16:55Z) mikeal:

Based on (#3 (comment) #83) we’ll eventually want a “Link Type” that is extensible the same way we are doing for collections.

I can see us wanting to add some features to the language for expressing simple paths in order to make building those types easier. We could piggy-back on that expressive syntax in order to define the CID + Path form you’re mentioning here.

We keep throwing around this term Path like there’s agreement and an set of conforming implementations, but there really isn’t. We have some trivial path traversers that operate only on the Data Model and a path traverser in IPFS that has a special case for dag-pb HAMT but we don’t yet have:

  • A definition for how paths operate against Schemas, so that we can define a path that would traverse through a HAMT or other collection in a generic way.
  • A traversal engine that supports such paths.
  • An expression for paths that tells us if the path is strictly against the Data Model or if it should operate against Schemas.