-
Notifications
You must be signed in to change notification settings - Fork 108
High level design rules #119
Comments
(Chunking parameters in IPFS and other details towards "reproducible import" are also interesting to note as being in a very similar general train of thought as HAMT parameters -- though technically chunking parameters are not our problem here since it probably[*] wouldn't be encoded in an IPLD Schema. (Probably.)) |
This may not be possible for most ordered collections. Typically, the tree shape will differ depending on the insertion order as the algorithms the tree uses to balance itself produce different trees even with the same data if it is applied in a different order. I would say the rule should be “When possible, the same content w/ the same Schema and codec should produce the same root CID. This may not always be possible, but when it is possible the best effort should be made in implementations and specifications to ensure it is.” There may be some interesting research we could encourage for better balancing algorithms for ordered collections that produce the same shape with the same content regardless of insertion order, but most of the ones I’m familiar with in databases do not. |
Rather than parameterize this in the schema, we could just produce tools that do graph mutations/creations that remain consistent. Newly created graphs can easily be consistent, and mutations should look to the existing graph for the codec/algorithm to use rather than relying on their own defaults. Ideally, I’d like schemas to be agnostic of the codec and hashing algorithm. Instead, as a general rule, I’d say that “Schemas can be applied to anything conforming to the IPLD Data Model.” If we don’t do this, I think that what you’ll see happen as the ecosystem grows is people using tools built to specific codecs and algorithms, because it’s trivial to build new tools for block creation and graph manipulation if you don’t do the work of making them codec and algorithm agnostic. Hell, I was doing the exact same thing just a few months ago before we started building the new JS stack. |
I have a clarifying question. Where do we draw the line at what should be in the schema vs in the data itself? Example: In unixfsv2 we’re going to need to set the chunking algorithm used to produce a file in order to ensure consistency when being re-imported, where does that belong? Do we just convert the entire |
HAMTs can do this though, iiuc. That's why we like them so much. In general, yes, though. If our definition for advanced layouts lets us bring somewhat arbitrary amounts of logic into the picture, it'll definitely let us bring nondeterministic/nonconvergent behaviors in. That's something we'll have to spec as a potential rule-breaker. But we can at least also spec "given that all advanced layouts in your data follow The Rule, then The Rule holds overall", and that's pretty good.
So this sounds nice on the tin. The thing is this seems to fall apart real quick whenever I actually try to write an algorithm that does that. I already tried to hold that logic together once while working on a prototype I guess there's examples where it works, but there's also some pretty basic ones where it doesn't because there's just not enough information (or there's too much information, with no obvious nor cheap conflict resolution choice). :/
Oh, lordie, yes. Inasmuchas these things are admitted to the discussion in schemas, they need to be parameters that are clearly confined to a variable in as small of a corner as possible. Everything needs to work completely generically over them or we goofed terminally. Agree completely. We just can't not talk about them when we're talking about the convergence properties. (Unfortunately.)
┐_(ツ)_┌ I think that's one of the fundamental questions, yeah. I guess the answers are more questions...? I'm coming up with "Does it Help if it's in the schema (provide better errors, benefit from migration, asymptotically faster features, etc) vs left to application logic?" and "Is it relevant to the reproducible-import/convergence/whatever-we-call-it question?" and then divide those by "How terrified are we of library implementer burden scope creep if this is coreward?".
Yeah, and here I'm absolutely over my head. Dunno. What are your thoughts? It might be reasonable to make large byte slice chunking one of the things we hoist as an Advanced Data Layout. Bit complex. Also pretty extremely cool. Counter arguments could probably for example come from thinking about @mib-kd743naq's manyformat conjunctathon, which uses the unixfs chunking for one merkledag, and then constructs several other merkledags reusing those same byte chunkings... there's just no earthly way to even get close to that kind of logic in pure schemas, so we have to be aware that limit is there and we'll have to live with it. |
Right, that’s why I specified “ordered collections.” I guess technically HAMT’s have an order but it isn’t a useful ordering ;) |
aahhhhh hahha. mmm. yes. Yeah, "ordered collection" meaning user-supplied order basically rockets things into Nope unless the user's own code is doing something convergent or CRDTish. True. |
I think there is a probably a way to do it, but I would expect that the cost of updates would be very large and that this cost would grow non-linearly with the size of the data structure. But hey, I won’t leave out the possibility that someone down the road finds a way to do it better than I can currently imagine it. |
I have only a general note. This discussion has a smell of "let's put everything into schemas". I would say parameters like e.g. chunk size should be put somewhere, but not necessarily into the schema. If if turns out that they should be there, it could be added later on. |
@vmx agreed, I’m just still searching for the right language to describe that line. I think that we sort of understand intuitively that if you give a program a schema it should not return you a graph with a mutated schema, and so many application configuration parameters needs to be in the data itself rather than in the schema. For instance, when a unixfsv2 implementation encodes a file it should also write out the parameters and algorithm it used to create the chunks, and that should be in the data rather than the schema. |
I think we should define a set of useful “tests” that can answer where things belong in IPLD.
Within the Schema space, we can probably find similar tests that can tell us where things should go in Schemas. |
Let’s say this is the foundations document now. |
High-level design rules -- rules that might not be encoded in the specs and the schemas themselves, but are unifying themes of the design -- are can be useful: they can help us here and now in making choices as we iterate our designs; and also they continue to be useful in the long run because such a rule is means simplicity and ease of understanding of the final resulting system.
I'm thinking about trying to draft a couple such rules about how the Data Model, the Schemas, and the other increasingly fancy parts of the IPLD system (especially advanced data layouts and collections) can be related.
Suppose we have some tree of JSON data.
One rule we already have in the system is that if we turn treat that JSON tree as IPLD, then we can derive a CID for it, and the CID should be the same as long as the data is the same.
(Except it's kinda not. There's some parameters at the top that we didn't account for yet. But we'll get to this in the next section.)
Now what rules do we have if we take that JSON tree and we want to represent it in IPLD, but split it into several blocks of data, using CIDs as links to connect them?
Do we still have that same high level rule about getting one CID from the rootmost object? We do not. The rootmost CID will depend on all the other parameters of linking: on the multicodec, on multihash, and indeed even on the CID version itself used for all the intermediate nodes.
Do Schemas help? (Maybe.)
"The same content per the Data Model plus the same Schema must yield same root CID".
That's a rule we could have.
It's a rule we don't currently have. We're aiming pretty close to it, but "close" and "rule" don't mix.
Multicodec and multihash continue to make this nontrivial: changes in choice of those parameters in linking, anywhere in the structure, bubble up.
Advanced layouts (e.g. hamts and their ilk) also may or may not make this somewhat complicated. We're already pretty sure we're going to put references to advanced layouts into schemas, so that's a start, but the discussion of what to do with further parameters (e.g. there's, what, at least three numerical parameters to a hamt which can be tuned fairly arbitrarily?) is still open.
What can we do to mitigate or improve this? Or are there other high level rules which would be similar to this but easier to conform to? (Or do we give up?)
Do we think it might be reasonable to specify the multicodec and multihash expected in a CID link when there's a link described in a schema?
Do we think it might be reasonable to specify any algorithmic parameters to advanced layouts in the schema, or in the serial form, or both?
More tricky: where are those above two questions still not enough?
In both of those situations, if we do specify those parameters in the schema, what do we feel is wise to do when reading data that doesn't match, but is still interpretable (e.g. if it does contain its own tags/parameters, as multicodec & multihash do, and as advanced layouts could choose to)? Should reject things that don't match precisely, for consistency? Or read them anyway, and if writing them, do so with the schemas params? Or read them anyway, and write them back with their original params? Is that option in the last question even consistently defined in all cases (e.g. what if we're adding data to a list of links)? The reading-vs-writing asymmetry raises a lot of interesting questions.
These are not purely Socratic questions: I don't have answers here.
"Yes; and both" are my first instincts as answers to the concrete questions in the last section, but they're not the only valid answers, and even if they were, it still leaves a lot of open room to define what to do in the case of mismatch and how to handle read-vs-write asymmetry (or perhaps more generally, asymmetry in how to handle things with a schema vs without, since advanced data layouts also need to straddle that (or explicitly choose not to)).
The text was updated successfully, but these errors were encountered: