-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: merkle references #8
base: main
Are you sure you want to change the base?
Conversation
Few notes based on the conversation with @mikeal yesterday
|
@mikeal provided more valuable feedback that I'd like to capture here before I'm able to incorporate. Specifically, he raised a concern about encoding type semantics just through byte prefix (tag), because it is plausible that someone may have byte array Changes of such a collision can be greatly reduced if instead of byte prefixing we prefix with a lot more distinct byte sequence. e.g. instead of prefixing integers with Alternatively we could still use tags we have here but in addition prefix each with the name of this spec. I think this is a good suggestion and worth incorporating here |
Another interesting insight from @mikeal's work on https://mikeal.github.io/cryptography/public-data.html where he uses double hashing allowing you to decide whether you want to reveal contents of the link or not. If we were to adopt it here we would have an ability to decide whether we want to reveal what the data is (in which case we share first order hash) or conceal it by sharing second order hash. Reminds me a lot of onion routing |
After reading @expede feedback I realized I over emphasized data locality point, which this proposal does not really address (parts of data might still be elsewhere). What I meant to emphasize is that partition choices affecting content identifiers is problematic. We could instead have way to derive identifier for any data giving us freedom to choose how to partition on read as opposed to write. I tried to adjust text accordingly |
|
||
#### Bytes | ||
|
||
Bytes is represented as a [merkle fold] of pair of leaf nodes. Left node is the cryptographic hash of the UTF-8 encoded string `merkle-structure:bytes/raw` and describes binary encoding format of the right node. Right node is a byte array of the bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I'm starting to wonder if the right node should be an actual bao tree here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm convinced now that it should be, that way hash of the raw bytes would come out same as the blake3 hash of that content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blake3 splits byte array into 1024 byte size chunks (last chunk may be shorter). If there is only one chunk that chunk is a the root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dug deeper into blake3 & bao and sadly compatibility is probably not going to work out. The problem is that leaf_hash(chunk, n)
and node_hash(left, right, is_root)
are not defined in terms of blake3
hash so we would either have to define same algorithm in the merkle-folding or would instead define them using the hashing function (as per spec today).
Seems to me that we could take advantage of all the benefits that blake3/bao offers with a generic hashing function which seems like a better tradeoff here.
That said it is probably worth evaluating how why chunk index n
is used and if it would make sense to incorporate that and same for the is_root
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the reference bao hash and blake3 hash come out the same
Although that is because bao leverages blake3 internals to compute leaf hash
https://github.com/oconnor663/bao/blob/e5f01f8c5d11f653eac861ee1a72f0fae5728460/src/encode.rs#L559-L562
https://github.com/BLAKE3-team/BLAKE3/blob/8fc36186b84385d36d8339606e4d1ea6ff471965/src/guts.rs#L38-L46
https://github.com/BLAKE3-team/BLAKE3/blob/8fc36186b84385d36d8339606e4d1ea6ff471965/src/lib.rs#L526-L537
https://github.com/BLAKE3-team/BLAKE3/blob/8fc36186b84385d36d8339606e4d1ea6ff471965/src/lib.rs#L408-L426
And to compute the node hash
https://github.com/oconnor663/bao/blob/e5f01f8c5d11f653eac861ee1a72f0fae5728460/src/encode.rs#L297-L300
https://github.com/BLAKE3-team/BLAKE3/blob/8fc36186b84385d36d8339606e4d1ea6ff471965/src/guts.rs#L48-L67
https://github.com/BLAKE3-team/BLAKE3/blob/8fc36186b84385d36d8339606e4d1ea6ff471965/src/lib.rs#L909-L927
https://github.com/BLAKE3-team/BLAKE3/blob/8fc36186b84385d36d8339606e4d1ea6ff471965/src/lib.rs#L408-L426
only(leaf) | ||
`````` | ||
|
||
Merkle fold of zero leaves is an empty hash (hash of zero bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merkle fold of zero leaves is an empty hash (hash of zero bytes) | |
Merkle fold of zero leaves is a leaf that is byte array with 0 bytes. |
I think this might be a better choice. Consequence is that empty list becomes double hash of the list operator. Also null abstract data format can be defined as null tag and empty tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a random set of reactions.
First, the encoding problem is already implicitly recognized and handled in a very primitive way in a lot of existing code, by indexing data by multihash -- so that way I can request whatever encoding I want and as long as the multihash matches, the peer can serve it back to me. Which doesn't solve the CBOR->JSON translation issue (since that will fail on decode) but does let you switch back and forth wth RAW bytes. Just an interesting observational note.
Second the partitioning problem is real too -- and I think it has parallels in the design of JSON APIs. I think back to the way people tried to build Hypermedia JSON APIs for years, encoding sub resource links alongside data, on the theory it would enable smart clients to consume data how they wanted. But ultimately fixed partitioning was inefficient, and hard for backend providers to evolve over time. GraphQL ate their lunch by allowing clients to specify exactly how much data they needed dereferenced. (btw, there's probably a world where you could save bandwidth by intentionally referencing everything but the scalar fields you care about, or paginating long list or map elements)
Re: Null -- the billion dollar mistake in Java is not the idea of a null type -- it's making all types implicitly able to be the type OR null. This type of issue really only is relevant IMHO at the schema layer of IPLD, if you want to go there (i.e. when you're dealing statically typed data, as opposed to the basic data model of IPLD which is more dynamic)
Ok so thinking through the @expede 's UCAN scenario, I would imagine the signature field would be derived by signing the root hash of the payload reference tree with the users private key. That wouldn't change if you replaced parts of the proof chain with reference tree hashes.
BTW, there's no reference to how one determines whether a node is a leaf or a reference tree -- is that encoded in the has in way similar to BAO?
Also, do we imagine a reader being able to request data in arbritrary formats and arbitrary levels of depth of de-referencing? Asking cause that's a non trivial amount of work to ask the person holding data to do. You'd essentially have to define a GraphQL like protocol to describe requests from the client.
I will say while this design appears to solve many problems, it's also a sufficient mind bender and compatibility break that it's a bit frightening to think about. In the UCAN example above, the signature would not match either the current signature you'd get from a UCAN with a full delegation chain OR the signature you get with a UCAN with references.
One way to consider this moving forward is to look at it an application level, rather than a universal data format level. The application a lot of people want for IPFS is "Store my files" which doesn't need all this. (Maybe directories are a candidate but honestly that could benefit from similicity too). On the other hand, if I wanted to use IPLD as distributed content addressed graph database to power say a social application, wow this seems amazing.
In that context, it might make sense to define this as kind of encoding format of its own. Personally, I don't think IPLD is useful when not working with structured data in an application context, rather than a file context. (UnixFS, while technically using IPLD, is a super weird format that only supports one kind of serialization and one set schema for blocks -- I wished we simply thought of it as a seperate thing)
BTW, I like turning all bytes blocks into a Bao Tree cause it creates a nice symmetry in the format.
Thanks for taking a look @hannahhoward I'll try to provide some notes inline or request for clarification.
I would summarize this that it is solved at block level, but not at the DAG level. That is if you have links from that block you're still very limited in kind of things you can do.
I think implicit proposition in here is that partition is often a read time decision and not a write time one.
I did take feedback from @expede and @ribasushi on this and have added
Correct if I understand you correctly. In other words
All data types are mapped to pair of the tag (format) and corresponding node. Tag will tell you whether the right node is a leaf or a subtree. It's bit tricky to compare to blake3 (which is what I think you meant). Because blake3 just chunks byte array into leaves and builds up a tree. Here we map data into list of nodes, primitives end up with nodes with depth 1 and recursive types can be arbitrary depth. However each node of the list gets merkle folded into effectively a leaf node from which we build BAO tree. It is bit confusing because you can look at each node of the list as subtree or a leaf. That said you can tell if what you have is partial DAG or complete because you can differentiate primitives from references
Short answer is NO, but longer answer is it really depend case by case basis. I think of this as a way that enables provider to choose level of indexing they want to support. It is really up to them and their customers to decide what kind of granularity they want to support. That said they provider could always reduce granularity on demand but they can not support more granular requests than they have chosen to index by. From W3Up perspective, we find that as a provider we are not in a good position to decide what to index and what to announce on the network. It really is domain specific and are thinking of ways to enable clients to choose what to index support corresponding reads. What is proposed here is similar in spirit I imagine user uploading 4GiB of CBOR data and then publishing verifiable index for parts within it.
That is accurate. That said however UCANs are moving into direction of removing proof chains from delegations anyway so it may not be as bad. Different way to view this proposal might be to treat is a canonical addressing schema, you still could have mapping from canonical to one being used. If you do so you suddenly become able to transcode things on the fly if requests are in canonical addressing scheme.
I agree. This is really taking what Iroh does for files and applying it to structured data. Although there is though that maybe two can converge to a degree, specifically I have considered mapping primitive strings and bytes types into BAO in which case bytes representing file would result in the exact same hash as blake3 assuming blake hashing is used.
Agreed. I think UnixFS suffers from the same issues outlined here, that is partitioning and layout is decided on write (well more like ingestion) and constrained in size by incremental variability. I think blake3 / bao makes far better tradeoffs here. Honestly just switch in mindset of UnixFS as a particular view over file might get us really far. In that case you can start comparing tradeoffs offered by that view vs blake3/bao.
👍 |
TODO