Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: merkle references #8

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

feat: merkle references #8

wants to merge 8 commits into from

Conversation

Gozala
Copy link
Contributor

@Gozala Gozala commented Jan 25, 2024

TODO

  • Update spec so that byte arrays are chunked and merkle-folded effectively making them compatible with blake3 (assuming blake2s hashing is used).
    • Even with different hashing algorithms like sha256 you'd get most of the BAO benefits that Iroh has demonstrated using blake3.
  • Add some notes on performance.
    • Point out how this is highly parallelizable algorithm
    • Notes on caching, specifically one could maintain table mapping from lang specific hashes to merkle-references avoiding need to rehash things already hashed and maybe even across sessions.

@Gozala
Copy link
Contributor Author

Gozala commented Jan 25, 2024

Few notes based on the conversation with @mikeal yesterday

  • @mikeal suggested to not worry about the data types and basically only have bytes, leaving it up to the user to encode them into bytes.
    • 🤔 I think there is a merit to the idea of layering things in a way that base layer would not care about data types only bytes and ordered lists of bytes. On top you could have a layer that knows about data types and provides mapping to bytes / lists of bytes.
    • I am not however persuaded by the argument that there is no consensus how to encode various types. In fact I'd argue that it is better to have a decision how to encode data for hashing than have a flexibility to decide encoding. Encoding floats in double precision for hashing does not imply you should encode them that way on disk or wire. In fact for this use case it makes sense to optimize encode for speed, while for disk / network it's likely better to optimize for size.
  • PKV double hashes keys. Initially that was my plan, but while writing this up I could no longer see what that would solve so I no longer do that. I'm still able to provide proofs without revealing attribute names so maybe binary merkle tree gives us property that we'd use double hashing for ?
  • I like that I can stick lists [a, b] and [c, d] just by looking at hashes. but I don't like that [[a, b], [c, d]] = [a, b, c, d]. Perhaps we could byte prefix list addresses and retain former without the later.
  • I'm not sure structure is the list of attributes is a good idea. Perhaps we should differentiate two. Just like ☝️ we could just byteprefix hash. Tradeoff is that we conceal less information, if we don't prefix you can't tell if reference is to a scalar or complex structure, and you also can not join structs and lists. If you do prefix then you can join structs and lists, but now you can tell if reference is to a struct / list or a scalar.
  • 🤔 I wonder if we should convert large strings / bytes into binary merkle trees also. If we do we'll gain ability to prove that some slice is within the structure, although not sure if it is worth it. Also sounds like something we already get from blake3 out of the box.

@Gozala
Copy link
Contributor Author

Gozala commented Jan 26, 2024

@mikeal provided more valuable feedback that I'd like to capture here before I'm able to incorporate. Specifically, he raised a concern about encoding type semantics just through byte prefix (tag), because it is plausible that someone may have byte array [2, 193, 15] which when hashed with sha256 will produce bbnelb523uc3jtdsqegf5odzm3qm3jsxrwg65e66arecvfkep4b2a which is coincidentally address for 1985 according to this spec.

Changes of such a collision can be greatly reduced if instead of byte prefixing we prefix with a lot more distinct byte sequence. e.g. instead of prefixing integers with 2 we could prefix them with UTF8 encoded bytes for LEB 128 formatted integer value. In fact we could reduce probability even more if we actually take LEB128 spec and use sha256 of it as a prefix. Probability of collision there is greatly reduced.

Alternatively we could still use tags we have here but in addition prefix each with the name of this spec.

I think this is a good suggestion and worth incorporating here

@Gozala
Copy link
Contributor Author

Gozala commented Jan 27, 2024

Another interesting insight from @mikeal's work on https://mikeal.github.io/cryptography/public-data.html where he uses double hashing allowing you to decide whether you want to reveal contents of the link or not. If we were to adopt it here we would have an ability to decide whether we want to reveal what the data is (in which case we share first order hash) or conceal it by sharing second order hash.

Reminds me a lot of onion routing

rfc/merkle-structure.md Outdated Show resolved Hide resolved
rfc/merkle-structure.md Outdated Show resolved Hide resolved
rfc/merkle-structure.md Outdated Show resolved Hide resolved
rfc/merkle-structure.md Outdated Show resolved Hide resolved
rfc/merkle-structure.md Outdated Show resolved Hide resolved
rfc/merkle-structure.md Outdated Show resolved Hide resolved
@Gozala
Copy link
Contributor Author

Gozala commented Feb 4, 2024

After reading @expede feedback I realized I over emphasized data locality point, which this proposal does not really address (parts of data might still be elsewhere). What I meant to emphasize is that partition choices affecting content identifiers is problematic. We could instead have way to derive identifier for any data giving us freedom to choose how to partition on read as opposed to write.

I tried to adjust text accordingly

@Gozala Gozala changed the title feat: merkle structure feat: merkle references Feb 4, 2024

#### Bytes

Bytes is represented as a [merkle fold] of pair of leaf nodes. Left node is the cryptographic hash of the UTF-8 encoded string `merkle-structure:bytes/raw` and describes binary encoding format of the right node. Right node is a byte array of the bytes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I'm starting to wonder if the right node should be an actual bao tree here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm convinced now that it should be, that way hash of the raw bytes would come out same as the blake3 hash of that content.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blake3 splits byte array into 1024 byte size chunks (last chunk may be shorter). If there is only one chunk that chunk is a the root.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug deeper into blake3 & bao and sadly compatibility is probably not going to work out. The problem is that leaf_hash(chunk, n) and node_hash(left, right, is_root) are not defined in terms of blake3 hash so we would either have to define same algorithm in the merkle-folding or would instead define them using the hashing function (as per spec today).

Seems to me that we could take advantage of all the benefits that blake3/bao offers with a generic hashing function which seems like a better tradeoff here.

That said it is probably worth evaluating how why chunk index n is used and if it would make sense to incorporate that and same for the is_root parameter.

Copy link
Contributor Author

@Gozala Gozala Mar 6, 2024

only(leaf)
``````

Merkle fold of zero leaves is an empty hash (hash of zero bytes)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Merkle fold of zero leaves is an empty hash (hash of zero bytes)
Merkle fold of zero leaves is a leaf that is byte array with 0 bytes.

I think this might be a better choice. Consequence is that empty list becomes double hash of the list operator. Also null abstract data format can be defined as null tag and empty tree.

Copy link
Member

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a random set of reactions.

First, the encoding problem is already implicitly recognized and handled in a very primitive way in a lot of existing code, by indexing data by multihash -- so that way I can request whatever encoding I want and as long as the multihash matches, the peer can serve it back to me. Which doesn't solve the CBOR->JSON translation issue (since that will fail on decode) but does let you switch back and forth wth RAW bytes. Just an interesting observational note.

Second the partitioning problem is real too -- and I think it has parallels in the design of JSON APIs. I think back to the way people tried to build Hypermedia JSON APIs for years, encoding sub resource links alongside data, on the theory it would enable smart clients to consume data how they wanted. But ultimately fixed partitioning was inefficient, and hard for backend providers to evolve over time. GraphQL ate their lunch by allowing clients to specify exactly how much data they needed dereferenced. (btw, there's probably a world where you could save bandwidth by intentionally referencing everything but the scalar fields you care about, or paginating long list or map elements)

Re: Null -- the billion dollar mistake in Java is not the idea of a null type -- it's making all types implicitly able to be the type OR null. This type of issue really only is relevant IMHO at the schema layer of IPLD, if you want to go there (i.e. when you're dealing statically typed data, as opposed to the basic data model of IPLD which is more dynamic)

Ok so thinking through the @expede 's UCAN scenario, I would imagine the signature field would be derived by signing the root hash of the payload reference tree with the users private key. That wouldn't change if you replaced parts of the proof chain with reference tree hashes.

BTW, there's no reference to how one determines whether a node is a leaf or a reference tree -- is that encoded in the has in way similar to BAO?

Also, do we imagine a reader being able to request data in arbritrary formats and arbitrary levels of depth of de-referencing? Asking cause that's a non trivial amount of work to ask the person holding data to do. You'd essentially have to define a GraphQL like protocol to describe requests from the client.

I will say while this design appears to solve many problems, it's also a sufficient mind bender and compatibility break that it's a bit frightening to think about. In the UCAN example above, the signature would not match either the current signature you'd get from a UCAN with a full delegation chain OR the signature you get with a UCAN with references.

One way to consider this moving forward is to look at it an application level, rather than a universal data format level. The application a lot of people want for IPFS is "Store my files" which doesn't need all this. (Maybe directories are a candidate but honestly that could benefit from similicity too). On the other hand, if I wanted to use IPLD as distributed content addressed graph database to power say a social application, wow this seems amazing.

In that context, it might make sense to define this as kind of encoding format of its own. Personally, I don't think IPLD is useful when not working with structured data in an application context, rather than a file context. (UnixFS, while technically using IPLD, is a super weird format that only supports one kind of serialization and one set schema for blocks -- I wished we simply thought of it as a seperate thing)

BTW, I like turning all bytes blocks into a Bao Tree cause it creates a nice symmetry in the format.

@Gozala
Copy link
Contributor Author

Gozala commented Feb 20, 2024

this is a random set of reactions.

Thanks for taking a look @hannahhoward I'll try to provide some notes inline or request for clarification.

First, the encoding problem is already implicitly recognized and handled in a very primitive way in a lot of existing code, by indexing data by multihash -- so that way I can request whatever encoding I want and as long as the multihash matches, the peer can serve it back to me. Which doesn't solve the CBOR->JSON translation issue (since that will fail on decode) but does let you switch back and forth wth RAW bytes. Just an interesting observational note.

I would summarize this that it is solved at block level, but not at the DAG level. That is if you have links from that block you're still very limited in kind of things you can do.

Second the partitioning problem is real too -- and I think it has parallels in the design of JSON APIs. I think back to the way people tried to build Hypermedia JSON APIs for years, encoding sub resource links alongside data, on the theory it would enable smart clients to consume data how they wanted. But ultimately fixed partitioning was inefficient, and hard for backend providers to evolve over time. GraphQL ate their lunch by allowing clients to specify exactly how much data they needed dereferenced. (btw, there's probably a world where you could save bandwidth by intentionally referencing everything but the scalar fields you care about, or paginating long list or map elements)

I think implicit proposition in here is that partition is often a read time decision and not a write time one.

Re: Null -- the billion dollar mistake in Java is not the idea of a null type -- it's making all types implicitly able to be the type OR null. This type of issue really only is relevant IMHO at the schema layer of IPLD, if you want to go there (i.e. when you're dealing statically typed data, as opposed to the basic data model of IPLD which is more dynamic)

I did take feedback from @expede and @ribasushi on this and have added Null. I still think there is a better way to model data than introducing Null but it is not the goal here so Null it is.

Ok so thinking through the @expede 's UCAN scenario, I would imagine the signature field would be derived by signing the root hash of the payload reference tree with the users private key. That wouldn't change if you replaced parts of the proof chain with reference tree hashes.

Correct if I understand you correctly. In other words hash([a, b, c]) will be same as hash([a, b, hash(c)]). If you meant something else please clarify

BTW, there's no reference to how one determines whether a node is a leaf or a reference tree -- is that encoded in the has in way similar to BAO?

All data types are mapped to pair of the tag (format) and corresponding node. Tag will tell you whether the right node is a leaf or a subtree.

It's bit tricky to compare to blake3 (which is what I think you meant). Because blake3 just chunks byte array into leaves and builds up a tree. Here we map data into list of nodes, primitives end up with nodes with depth 1 and recursive types can be arbitrary depth. However each node of the list gets merkle folded into effectively a leaf node from which we build BAO tree.

It is bit confusing because you can look at each node of the list as subtree or a leaf. That said you can tell if what you have is partial DAG or complete because you can differentiate primitives from references

Also, do we imagine a reader being able to request data in arbritrary formats and arbitrary levels of depth of de-referencing? Asking cause that's a non trivial amount of work to ask the person holding data to do. You'd essentially have to define a GraphQL like protocol to describe requests from the client.

Short answer is NO, but longer answer is it really depend case by case basis. I think of this as a way that enables provider to choose level of indexing they want to support. It is really up to them and their customers to decide what kind of granularity they want to support. That said they provider could always reduce granularity on demand but they can not support more granular requests than they have chosen to index by.

From W3Up perspective, we find that as a provider we are not in a good position to decide what to index and what to announce on the network. It really is domain specific and are thinking of ways to enable clients to choose what to index support corresponding reads. What is proposed here is similar in spirit I imagine user uploading 4GiB of CBOR data and then publishing verifiable index for parts within it.

I will say while this design appears to solve many problems, it's also a sufficient mind bender and compatibility break that it's a bit frightening to think about. In the UCAN example above, the signature would not match either the current signature you'd get from a UCAN with a full delegation chain OR the signature you get with a UCAN with references.

That is accurate. That said however UCANs are moving into direction of removing proof chains from delegations anyway so it may not be as bad.

Different way to view this proposal might be to treat is a canonical addressing schema, you still could have mapping from canonical to one being used. If you do so you suddenly become able to transcode things on the fly if requests are in canonical addressing scheme.

One way to consider this moving forward is to look at it an application level, rather than a universal data format level. The application a lot of people want for IPFS is "Store my files" which doesn't need all this. (Maybe directories are a candidate but honestly that could benefit from similicity too). On the other hand, if I wanted to use IPLD as distributed content addressed graph database to power say a social application, wow this seems amazing.

I agree. This is really taking what Iroh does for files and applying it to structured data. Although there is though that maybe two can converge to a degree, specifically I have considered mapping primitive strings and bytes types into BAO in which case bytes representing file would result in the exact same hash as blake3 assuming blake hashing is used.

In that context, it might make sense to define this as kind of encoding format of its own. Personally, I don't think IPLD is useful when not working with structured data in an application context, rather than a file context. (UnixFS, while technically using IPLD, is a super weird format that only supports one kind of serialization and one set schema for blocks -- I wished we simply thought of it as a seperate thing)

Agreed. I think UnixFS suffers from the same issues outlined here, that is partitioning and layout is decided on write (well more like ingestion) and constrained in size by incremental variability. I think blake3 / bao makes far better tradeoffs here.

Honestly just switch in mindset of UnixFS as a particular view over file might get us really far. In that case you can start comparing tradeoffs offered by that view vs blake3/bao.

BTW, I like turning all bytes blocks into a Bao Tree cause it creates a nice symmetry in the format.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants