Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPIP-305: CIDv2 - Tagged Pointers #305

Closed
wants to merge 2 commits into from

Conversation

johnchandlerburnham
Copy link

This adds a spec for a CIDv2 proposal, originally discussed here: multiformats/cid#49

Included are corresponding changes to the CID specification repository (https://github.com/yatima-inc/cid/blob/master/README.md) and a preliminary draft implementation on rust-cid (https://github.com/yatima-inc/rust-cid/tree/cid-v2). I am happy to send PRs to https://github.com/multiformats/cid and https://github.com/multiformats/rust-cid respectively, but have not yet in the interest of centralizing discussion.

@johnchandlerburnham johnchandlerburnham requested a review from a team as a code owner August 7, 2022 01:36
@johnchandlerburnham johnchandlerburnham changed the title new IPIP for CIDv2 tagged pointers IPIP: CidV2 - Tagged Pointers Aug 7, 2022
@johnchandlerburnham johnchandlerburnham changed the title IPIP: CidV2 - Tagged Pointers IPIP: CIDv2 - Tagged Pointers Aug 7, 2022
@rvagg
Copy link
Member

rvagg commented Aug 8, 2022

@vmx I'd love to hear your take on this since it's coming from a Lurk perspective now and you're a bit more versed in that land than most of us.

The big outstanding question for me before moving on to finer things is the version signalling. My bottom-up implementers brain is coming at this from what's in the bytes and I don't think either the original PR or this are getting specific enough (or I'm reading them wrong, very possible).

Are we using the CID version specifier the proper way:

  • <version=2 varint>
  • <multicodec1 code varint>
  • <hash fn1 code varint>
  • <digest1 len varint>
  • <digest1>
  • <multicodec2 code varint>
  • <hash fn2 code varint>
  • <digest2 len varint>
  • <digest2>

or wrapping (hiding) this in a CIDv1 but using the codec to signal cidv2:

  • <version=1 varint>
  • <cidv2 multicodec=0x02>
  • <identity mh code=0x00>
  • <cidv2 length varint>
  • <multicodec1 code varint>
  • <hash fn1 code varint>
  • <digest1 len varint>
  • <digest1>
  • <multicodec2 code varint>
  • <hash fn2 code varint>
  • <digest2 len varint>
  • <digest2>

I'm suspecting that the language here, and over in multiformats/cid#49 might be suggesting both, with the latter reserved as a way to do backward compatibility where you know you're going to need it? Which might become a CIDv0-like problem when we've moved to a state where every system supports CIDv2 but everyone still wants to pass around CIDv1-wrapped-CIDv2s.

@johnchandlerburnham
Copy link
Author

@rvagg Just to clarify, the intent of the proposal is to have a v2 version varint, so the former of the two possibilities in your comment:

<cidv2> ::= <multicodec-cidv2><multicodec-data-content-type><multihash-data><multicodec-metadata-content-type><multihash-metadata>

That's what my Rust implementation does
https://github.com/yatima-inc/rust-cid/blob/01759499074136fefcaf1b1610661b76a81aff28/src/cid.rs#L210

However, I also described a backwards-compatible way of embedding a Cidv2 inside a Cidv1, as an optional thing:

<cidv2-inside-cidv1> ::= <multicodec-cidv1><multicodec-cidv2><identity-multihash-of-cidv2-serialization>

We can remove this from my proposal if that would make things more clear. You can already use the identity multihash to embed any bytes inside a CIDv1, so there's no change to anything implied by this, and will still be possible regardless of what the final CIDv2 spec is. It's an existing (though rarely used) feature of multihashes and CIDs.

Copy link
Contributor

@aschmahmann aschmahmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some questions and alternatives to try and flush out our options here and the consequences.

I think referring to this proposal as tagged pointers is interesting. Both because it largely matches tagged pointers in them being useful but not necessarily required, and because to some extent CIDv1 is already a tagged pointer in that it is tagged with codec information.

If we extend the tagged pointer concept to allow further information then probably we should make sure there's enough flexibility here that we're less likely to continue to add more tagged pointers in a CIDv3.


Having arbitrary-length CID metadata allows the data to be fully self-describing and abstracts application-specific interpretation away into the metadata CID.

### Compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this indicates the scope of what in the ecosystem will be effected by this change. The current text makes it appear as though introducing this new version of CID will be fairly trivial when that's not quite the case.

Some example ramifications:

  • Existing CID parsers would need to be updated to support CIDv2, or else would error
  • Many CIDv2s will be too big to represent in subdomains which would effectively break how some tooling (e.g. HTTP gateways) work with CIDs today. Yes, the same is true of large CIDv1s but this is more likely with CIDv2s since they contain two CIDv1s.
  • Tooling that only supports CIDv1 could break if any node being accessed within the graph contains a CIDv2. This could provide a problematic UX for tools that say only take a root CID and assume they can operate on a graph
  • Existing IPLD tooling may need to be upgraded to support the new type of links and expose needed information to users

Many of these are just the cost of doing upgrades in general, or the cost of adding metadata to links, but we should accumulate these and know what we're getting into here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing CID parsers would need to be updated to support CIDv2, or else would error

Yes, but they would error cleanly, since afaiu parsers already have to match on the version varint. But I don't think its particularly complicated change to add a case for version 2, as I did in multiformats/rust-cid#123

Many CIDv2s will be too big to represent in subdomains which would effectively break how some tooling (e.g. HTTP gateways) work with CIDs today. Yes, the same is true of large CIDv1s but this is more likely with CIDv2s since they contain two CIDv1s

The most common CIDv2 sizes will probably be pairs of 256-bit or 512-bit hashes, which are roughly the same sizes as a 512-bit or 1024-bit CIDv1, which should be nearly universally supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most common CIDv2 sizes will probably be pairs of 256-bit or 512-bit hashes, which are roughly the same sizes as a 512-bit or 1024-bit CIDv1, which should be nearly universally supported.

Unfortunately not. The base36 encoding of a SHA2-512 raw CID is too long to fit into a URL subdomain. e.g. https://cid.ipfs.tech/#kf1siqqaod24wzk1b0jwakpjxj8z9xaqxwh56nnc267oznfqrm8cc0w0f36g6ir7zb1tuso6ch7kg3at9o6bnr8lm34hty32o1l0ljycu is 105 characters which is greater than the 63 character DNS label limit.


- [CIDv2 with arbitrary-precision multicodec size](
https://gist.github.com/johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7#appendix-a-cidv2-and-arbitrary-precision-multicodec)
- CIDv2 with nested hashes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you detail this a bit more? Is this just allowing the CIDs inside the CIDv2 to also be CIDv2's rather than restricting them to CIDv1s?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this proposal the contents of CIDv2s are not CIDv1s, but rather the broken apart multicodec-multihash pairs. This is specifically to mitigate the issues with nesting raised in the previous discussion multiformats/cid#49.

The other idea of arbitrary-precision multicodec is to figure out how to safely remove the 9-byte limit on multicodec-varints (such as by adding a size field), and then managing larger metadata tags by allocating ranges on the now infinite multicodec table. However, that solution requires both technical changes to implementations, as well as process changes to how multicodec is managed, whereas the current IPIP should largely only require the former.


The proposal is also designed to be purely opt-in and backwards compatible with existing implementations. That said, some work may be required to ensure that implementations that do not wish to support CIDv2 can either read a CIDv2 as if it were a CIDv1 (and discard the trailing metadata), or to error on the CIDv2 entirely.

### Alternatives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if instead of a CIDv2 you just used a CIDv1 that looked something like { Data: <data-cid>, Metadata: <metadata-cid> } or { Data: <data-cid>, Metadata: <metadata-cid>, Type: <whatever-type-info-you-want> }

This could be encoded as a CIDv1 in DAG-CBOR, or using any other format you wanted.

Some advantages:

  • It doesn't require bumping the CID version and as a result a lot of tooling can be left alone
  • Your type data can be more than a single block without requiring an extra level of indirection
  • You can specify what your data is without reserving a code in the table for every data type you could want.
    • Sure maybe "IPLD Schema" is a reasonable way of representing many types, but I could also see applications showing up with a list of 100 types they'd want codes for. Allocating codes like this isn't just a pain for table maintenance and taking up table space, but it also forces more of the data structure logic out of band which makes it harder for an application that doesn't know what to do with the unknown code number to figure out what to do.

Some disadvantages:

  • It takes up a couple more bytes
    • It's more than a few bytes if you want to be self-describing, but in theory an application could just have a tuple of CIDs which is fairly minimal overhead. This makes the data not self-describing, but it's not in the current proposal either
  • A given application or ecosystem needs to decide on how to encode their metadata/type information
    • This needs to happen in the current proposal anyhow, but in the current proposal developers don't have to think about how to disambiguate data from metadata just how to actually encode their metadata

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if instead of a CIDv2 you just used a CIDv1 that looked something like { Data: , Metadata: } or { Data: , Metadata: , Type: }

You mean creating IPLD lists or objects and then hashing them? This works fine for some cases, but not for others since it requires that you have to traverse the hash. In the write-up I did for @vmx I go into some detail about why for Lurk we need to have the metadata tags in the pointers themselves: https://gist.github.com/johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7

Your type data can be more than a single block without requiring an extra level of indirection

For large metadata, I think having a hash of the metadata is unavoidable. The advantage of this CIDv2 proposal though is that since a CIDv2 is isomorphic to a pair of CIDv1s, you can store your metadata and data in the same content-addressed store with self-describing keys. We do this in Yatima where we have large data and metadata trees for program ASTs: https://github.com/yatima-inc/yatima-lang/blob/35f868ab05a4059690e6da9db2e5c4419537fcd0/Yatima/Datatypes/Cid.lean#L23

So this proposal supports both large metadata (like Yatima's full metadata CIDs) and small metadata (like Lurk's 16-bit tags)

Sure maybe "IPLD Schema" is a reasonable way of representing many types, but I could also see applications showing up with a list of 100 types they'd want codes for. Allocating codes like this isn't just a pain for table maintenance and taking up table space, but it also forces more of the data structure logic out of band which makes it harder for an application that doesn't know what to do with the unknown code number to figure out what to do.

I think what would make sense if this proposal is adopted to allocate a single metadata multicodec for each application, whether that's IPLD Schema, Lurk, Yatima, etc., and then each application would have its own logic of what its own metadata means. E.g.

name tag code description
dag-cbor ipld 0x71 MerkleDAG cbor
... ... ... ...
ipld-schema ipld 0x3e7a_da7a_0001 an IPLD Schema DML in dag-cbor
lurk-metadata lurk 0x3e7a_da7a_0002 A Lurk tag in the identity multihash
yatima-metadata yatima 0x3e7a_da7a_0003 A hash of a Yatima metadata AST

This has a similar effect as allocating ranges in the multicodec table, but without the centralized overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean creating IPLD lists or objects and then hashing them? This works fine for some cases, but not for others since it requires that you have to traverse the hash.

I read through the writeup, but still don't understand. What's the problem that you run into if instead of something like

<0x02><lurk-data-code><lurk-data-multihash><lurk-tag-code><lurk-tag-identity-multihash>

you had

taggedLink = EncodeDagCbor([<0x01><lurk-data-code><lurk-data-multihash>, <0x01><lurk-tag-code><lurk-tag-identity-multihash>])

<0x01><0x71><identity-multihash-of-taggedLink>

It seems like the bytes would be almost the same, and any code working with lurk data would already know how to do the conversion of the CIDv1 into two different objects and the use of identity multihashes saves you from doing any repeated hashing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key difference is that the lurk tags are not legible from taggedLink without traversing the pointer. In the Lurk case, this might be impossible if we're pointing towards a private input.

@porcuquine, @vmx and I had a long discussion on the Lurk discord about why this is necessary: https://discord.com/channels/908460868176596992/913200327547822110/964156408490754058

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if instead of a CIDv2 you just used a CIDv1 that looked something like ...

My worry about just pushing as much as we can into CIDv1 is that we end up losing the utility of the CID because it just becomes a way to squish in arbitrary data to a point in a block. One of the main purposes of a CID in IPLD is to provide clear linking semantics between blocks. If we overloaded CIDv1 and hid the actual content address of the link in an inline portion of it then even though the blocks might load fine in existing systems, the DAG disappears because the links aren't links anymore. We end up at the same place as a CIDv2 of having to update all our systems to interpret this new thing, and while it may be less painful and give us more time to adjust, it also gives us lots of space to not upgrade at all—or to give edges of our ecosystem space to not upgrade. Turning DAGs into collections of arbitrary blocks.

The choice would be something like: would you rather push your DAG to pinning service where you don't know if they support the new inline CIDv1-with-embedded-link, and therefore, just in case, you have to push them each block one by one and get them to pin each block individually. Or, have the pinning service error with "unknown CID version: 2" and move on to a different pinning service, knowing that you just want to pin a root and they'll take care of the DAG connectivity.

I think I'm on team just accept the pain and upgrade all the things even though it's going to take time. I also think I'd prefer to not have a CIDv1 variant in the spec because having an easy way out might leave us in a half way state that sucks more than just biting the bullet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think I'd prefer to not have a CIDv1 variant in the spec because having an easy way out might leave us in a half way state that sucks more than just biting the bullet.

I think that makes a lot of sense. While my initial thinking was that CIDv2 would an optional extension that would live alongside CIDv1, I think that there's certainly a way to modify CIDv2 to have it work as a CIDv1 replacement.

Specifically I think what I would want to do is

  pub struct Cid<const S: usize, const M: usize> {
    /// The version of CID.
    version: Version,
    /// The codec of CID.
    codec: u64,
    /// The multihash of CID.
    hash: Multihash<S>,
    /// metadata multicodec
    meta_codec : Option<u64>
    /// metadata multihash
    meta_hash : Option<Multihash<M>>  
}

And then we would need a bit to switch on whether the cid has metadata or not:

<cidv2> ::= <multicodec-cidv2><multicodec-data-content-type><multihash-data>(<multicodec-metadata-content-type><multihash-metadata>)

or

<cidv2> ::= <multicodec-cidv2><multicodec-data-content-type><multihash-data><has-metadata-varint>(<multicodec-metadata-content-type><multihash-metadata>)

where everything in the parenthesis is present if has-metadata = 1 and absent if has-metadata = 0.

If we don't want to add a whole extra varint for a single bit though, as we could actually switch on the version varint, where Version::V1 has no trailing metadata and Version::V2 has mandatory trailing metadata. That's maybe more in the same vein as "CIDv2 as optional extension for CIDv1" though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the struct but you're right about the optionality - I don't know if I have an opinion yet on having an additional bit vs making metadata mandatory for v2 and therefore requiring a v1 where there is no metadata. A third-way would be to make it mandatory if you're using a v2 but allow for the metadata to be 3 zero-bytes [0,0,0] (codec=0, hasher=identity/0, digest length=0) which would be equivalent to the v1 form - 3 wasted bytes instead of a single one for a flag, but you still get to choose whether you use a v1 to save those bytes.

One thing that continues to bother me about this (I mentioned this in the other thread) is that I lose the ability to inspect initial bytes to see what's coming. Currently we can do this with just enough bytes to read 3 varints: https://github.com/multiformats/js-multiformats/blob/dcfdac59df3570b85e633afae5ac8f6caf0a4441/src/cid.js#L312-L324

Arguably the utility of this isn't as great as it seems, but I'd probably have to remove that function, or make it throw, or something else in the case of a CIDv2. Its main use is in decodeFirst() (function defined just above) which is basically the same as: https://github.com/ipfs/go-cid/blob/802b45594e1aed5be3a5b99f00991e9fa8198bfa/cid.go#L691 - the use-case being - "here's a source of bytes I know starts with a CID, give me the CID and the remaining bytes". If there were a way to make it easier to do this initial-bytes-inspection then that'd be great, but it's not a blocker. e.g. if we must have a flag for these optional pieces, we could turn it into a "full length" varint and put it near the front; for common cases I think we'd still fit that in a single byte so it wouldn't be a massive waste. 🤷

The proposal is also designed to be purely opt-in and backwards compatible with existing implementations. That said, some work may be required to ensure that implementations that do not wish to support CIDv2 can either read a CIDv2 as if it were a CIDv1 (and discard the trailing metadata), or to error on the CIDv2 entirely.

### Alternatives

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative is if instead of redefining CID we redefined what Link means in the IPLD Data Model.

From what I can tell CIDs are used in primarily two places:

  1. As the descriptions of objects that users and applications pass around (e.g. ipfs://<cid>)
  2. As the internal links inside of DAGs

Given that the object descriptions always have their own custom meaning anyway (e.g. ipfs:// currently is approximately equal to "try seeing if the data is UnixFS", ipfs block get <cid> assumes the data is an independent block, v1 of the remote pinning API assumes the CID to pin is the root of a graph, ...) adding metadata here is not particularly interesting.

Adding metadata inside of the DAG is interesting, however, changing the CID spec isn't necessary for this. You could also change what links mean in the IPLD Data Model and get the same result. Historically it appears that this was intentional, for example in https://github.com/ipld/ipld/blob/835d010583accf0dbec7f3ddbd4b6a66f86e2fa2/_legacy/specs/FOUNDATIONS.md#linked it's indicated that Links were intended to eventually allow for referring to data inside of blocks. Similar logic could extend to allowing for other kinds of type information there as well.

  • Advantages:
    • No need to bump the CID version and so a lot of existing tooling can be left in place
    • Type data can be more than a single block without requiring an extra level of indirection
    • It's not necessary to define codes in the table for your types if you don't want to
  • Disadvantages:
    • Shares disadvantages with the current proposal regarding the need to rework tooling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPLD already feels a little like a second-class citizen in a lot of IPFS implementations, and I worry that breaking the identity between CID and IPLD::Link would just exacerbate that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can that both be true and this CIDv2 proposal be relevant? If you take the position that non-IPLD things are second class then what you're left with is basically UnixFS and then what are these tags going to do for UnixFS data? In order for the tags to be useful the IPLD tooling is going to need to expose it anyhow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take the position that non-IPLD things are second class

That's not my position. I was observing that in e.g. the IPFS http api we have two parallel sets of calls for ipfs block and ipfs dag: https://docs.ipfs.tech/reference/kubo/cli/#ipfs-dag, with the latter being generally less well supported.

Changing the IPLD data model to make an IPLD::Link not a CID would probably result in a lot of implementations just not supporting IPLD

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the IPLD data model to make an IPLD::Link not a CID would probably result in a lot of implementations just not supporting IPLD

How are these implementations benefiting from the tag information inside the CID if the IPLD tooling doesn't support it exposing or working with that tag information? In your example how would you expect either of kubo's block or dag commands to change to benefit from CIDv2 without having IPLD tooling support?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't really expect the kubo commands to change much, the additional information in a CIDv2 is primarily intended to be used at the application level.

How are these implementations benefiting from the tag information inside the CID if the IPLD tooling doesn't support it exposing or working with that tag information?

Specific IPLD libraries like rust-cid will support extracting/manipulating the tag information, and that should be enough for the specific use-cases of CIDv2 I'm aware of

Comment on lines +67 to +68
meta_codec : 0x3e7ada7a,
meta_hash: <schema-multihash>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, but what happens if my schema starts becoming large? For example, say I have a 3MiB schema. Now this schema exceeds the 2MiB block limit imposed by many IPFS implementations and that 3MiB schema won't be transferrable. Maybe a 3MiB schema seems excessive, but people may go down this road for other reasons (e.g. I want my tag to be wasm-module and my WASM code happens to be large).

I could start playing around with a few levels of workarounds here such as:

  1. Get a new code for unixfs-representation-of-schema
    • Sad because now I need to change my code to process schema and unixfs-representation-of-schema as schemas
    • Sad because I need a new code for every different system I use to encode my bytes (UnixFS, FBL, BitTorrent v1/v2, WNFS, etc.)
  2. Make a new type ipld-ADL-wrapper-dag-cbor that looks like { TypeData: <type-cid>, TypeADL: "unixfs" } encoded as dag-cbor, and add code so that when I encounter an ipld-ADL-wrapper-dag-cbor I recurse in a layer
    • Sad because it seems like we end up having to make our own type system anyhow despite the CIDv2 version bump

This seems to indicate that putting type information in the CID this way is going to be problematic because types may themselves have types and so we may want to deal with them the same way we deal with the data itself (e.g. allowing the metadata to be a CIDv2 as well, or one of the other proposals).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, but what happens if my schema starts becoming large?

It doesn't sound like this is a CIDv2 specific issue. I can store a 3MiB dag-cbor IPLD object on IPFS and generate a CIDv1 with its sha256 multihash, right? In terms of transport, I think CIDv2 will just behave like a pair of CIDv1s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't sound like this is a CIDv2 specific issue.

No, it is a specific issue with this CIDv2 proposal. Nowhere else do we use codec identifiers as data types, we use them as deserialization types. As a result there is no notion of a data type changing or becoming too big, that becomes an application layer concern. For example, an object can represent a UnixFS directory whether it is a single directory block or the root of a sharded HAMT.

By using the code as a nominative type rather than a description of how to deserialize the data you've navigated into a position where there's nowhere to identify both the type of the data and how to get it as a multiblock data structure without another level of indirection. However, that level of indirection could similarly be used instead of CIDv2 entirely (see alternative proposal).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowhere else do we use codec identifiers as data types, we use them as deserialization types.

Using a codec identifier as a datatype isn't an essential (or actually even intended) part of this proposal, so I'm absolutely happy to make any changes you suggest to better align with how multicodecs should be used.

For example, replacing the 0x3e7ada7a example codec for ipld-schema with dag-cbor, and in general specifying that the metadata codec should refer to the metadata format would be totally fine for the intended use-cases

CidV2 {
data_codec: 0x71,
data_hash: <data_multihash>,
meta_codec : 0x3e7ada7a,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a single code for "IPLD schema" followed by a multihash seems off. IPLD schemas can be represented in multiple different formats including dag-json and dag-cbor. Is this a codec for ipld-schema-dag-json?

It seems quite bizarre that we'd need to define multiple codes for ipld-schema-<some ipld codec> for any codec we might want to use to encode a schema. Basically what's happened here is we've glued back together the structure of the data and the serialized form of the data when describing type information. While sometimes users might be fine with that I suspect other times they may not, just as is the case with regular data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a single code for "IPLD schema" followed by a multihash seems off. IPLD schemas can be represented in multiple different formats including dag-json and dag-cbor. Is this a codec for ipld-schema-dag-json

There are a lot of things in multicodec for which this is also true (e.g. the ethereum codecs: https://github.com/multiformats/multicodec/blob/master/table.csv#L55) and my understanding of how it works there is that the format is just described in the description (such as https://ethereum.org/en/developers/docs/data-structures-and-encoding/rlp/, even though you could in principle encode any RLP data as dag-cbor if you wanted)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of things in multicodec for which this is also true ... the ethereum codecs ... even though you could in principle encode any RLP data as dag-cbor if you wanted

IPLD codecs tell you how to decode serialized representations of data (into the IPLD data model), not necessarily what the data is or what it's for. The ethereum codecs, like the Git ones are tied to a particular serialized data format if you wanted to transcode the data into something like dag-cbor tagging the data with the prior codec would result in a deserialization error.

Many existing hash linked data structures have more fixed representations then say the FBL ADL which is defined over arbitrary serialized forms as long as they can be decoded into a compatible IPLD Data Model layout. As a result it can appear as though the codecs are types even though they're deserialization mechanisms.

This means really what you'd need to express the type correctly is a second code to say "this is dag-json" next to the code saying it was an ipld-schema.

Copy link
Author

@johnchandlerburnham johnchandlerburnham Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPLD codecs tell you how to decode serialized representations of data (into the IPLD data model), not necessarily what the data is or what it's for.

The key word is necessarily there. A multicodec can absolutely tell you what the data is for. For example, if you have a CIDv1 that points to an Ethereum Block you could equally choose to encode using

name tag code description
rlp serialization 0x60 recursive length prefix
eth-block ipld 0x90 Ethereum Header (RLP)

Likewise we have a codec for all cbor and a more specific codec for dag-cbor. And ofc raw supersets everything.

So there's nothing strange if the IPLD Schema team wanted to set a default format

name tag code description
ipld-schema ipld 0xdead_beef Ipld Schema (dag-cbor)

or with dag-json. Yes we could have ipld-schema-dag-json, ipld-schema-dag-cbor to disambiguate, but that seems like it should be an application level decision whether or not they'd want to ask for multiple multicodecs to do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key word is necessarily there. A multicodec can absolutely tell you what the data is for.

If doing nominative typing this way was reasonable then CIDv2 wouldn't be necessary in almost any application since you could just register every type as a different codec. IIUC this kind of thing would in theory work for the UCAN case as well, it's just that using the global code table for nominative typing like this seems bad. Applications can end up with many different named data types, sometimes it's 10s or 100s, or the many more that Lurk would require reserving codes in the table this way.

Some links around not using multicodecs for nominatives types:

Yes we could have ipld-schema-dag-json, ipld-schema-dag-cbor to disambiguate, but that seems like it should be an application level decision

Sure, but how can I do a non-disambiguated ipld-schema that just works like IPLD Schemas do on any IPLD Data Model data? This code field has provided nominative typing, but without enough parameters to be useful for parameterized nominative types like IPLD Schemas (or dealing with multiblock data structures as in (#305 (comment))

Copy link
Member

@rvagg rvagg Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of glad this discussion is happening here, although I feel it might be a diversion from the main discussion—which is why it's probably good that we get this on the table now. This specific point is why I was hoping to have @vmx chime in. I worry that the Lurk specifics embedded in the doc here might be a distraction from the main goal. Even after reading all of this I don't really understand why, with the second CID-ish for metadata Lurk just couldn't encode a dag-cbor, dag-json, or even raw custom format bytes with the tag they want. Specifically: meta_codec could be dag-cbor (0x71), and meta_hash be inline (0x0) with whatever you like for your tag—you could even embed the mega-int here that the 9-byte varints are getting in the way of currently.
Perhaps that's essentially what you're aiming for through the use of a new "codec" to identify a "schema", just keeping it more efficient.

But my point again is that I think this is a distraction for the purpose of this spec. If Lurk wants to abuse the multicodec spec then that's their choice. It would be best for everyone if they want to register a new codec for this purpose to identify a "schema" in the multicodec table and we could continue this discussion there. For now, I think 0x3e7ada7a is in the way. It stands apart from the commonly understood purpose of a CID and as this discussion is suggesting there's a weirdness about it that leads us into a deep hole (the multicodec repo has many of these deep holes, covering very similar territory, I even had this discussion specifically about rlp and the eth codecs just a couple of months ago). We accept that there are squishy edges to the concept of a "multicodec", but always work to try and keep things toward the well understood and agreed-upon center where possible.

So my suggestion is to remove 0x3e7ada7a, shunt this distraction to the multicodec repo in due course, and go with something more commonly understood - maybe just an inline dag-json blob. Then we can at least start reasoning about the basics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But my point again is that I think this is a distraction for the purpose of this spec.

I agree. I see this just as an example of how people might want to use it. I think the purpose of the proposal should be about, whether we want those CIDs with two pointers to provide additional context or not.

In regards to Lurk, I also don't think the 0x3e7ada7a codec is needed. It could just be e.g. DAG-CBOR and you could encode your schema as such.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even after reading all of this I don't really understand why, with the second CID-ish for metadata Lurk just couldn't encode a dag-cbor, dag-json, or even raw custom format bytes with the tag they want.

Just to clarify, we absolutely can, and this is part of the intent of having the second CID. The 0x3e7ada7a was only meant as an illustrative example for ipld-schemas, but the rest of the proposal is the same if we just replaced it in the doc with 0x71 for dag-cbor, as @vmx suggested.

Regarding the prior topic about nominative types, I don't think either Lurk or Yatima need or want to add typing to multicodec, really. To be super concrete about what we need: For Lurk, we want to add a 16-bit metadata field to our CIDs, and for Yatima we want a 256-bit metadata multihash. In terms of multicodecs, I don't think it matters that much to either of those use cases whether we get a single application codec, multiple application codecs, or no codecs (and we just use e.g. dag-cbor). As long as we have a flexible way to add metadata, if we do end up needing additional info in pointers, we can just put it in that variable metadata field (e.g. with the identity multihash)

@Ericson2314
Copy link

Just as Rust doesn't force every pointer to be to a trait object (be double wide with vtable) so I don't think this is good for CIDv2.

We could have a "fat CID" spec in addition, but I don't think it should replace today's CIDs. It is simply a different use-case.


It is also possible to go in the other direction (e.g. WASM decoder as @aschmahmann mentions) and for different purposes do a pair of a raw multi-base and decoding function) but that only makes sense once we have a spec for the decode looks like.

In that latter scheme I would want to be careful that we aren't even chooosing between "enumerate" different formats but handling arbitrary data. E.g. the definition of "structured block" (referred to by such a CID) equality would be something like

(f, b) == (g, c) = f b == g c

i.e. even if two decoders are different, if the result of running them on the raw block (referred to by multibase) is the same, then the referred-to structured blocks are also the same.

f == g could be used as an optimization (f == g && b == c => (f, b) == (g, c)) but f != g doesn't mean much of anything as semantic function equality (as opposed to syntactic) is not decidable.

This is neat stuff that makes IPFS more inherently pluggable than enumerated with a fixed set of policies (multicodec), but I don't get the sense that that is what this proposal is going for.

@RangerMauve
Copy link

Just wanted to chime in that we're still pretty excited about this happening from the IPLD team's side. Particularly this could be useful for signaling ADLs and schemas and would be a good fit with IPLD URLs for signaling extra parameters beside CIDs.

So far I think we're most excited about extensions that don't require extending any multiformats tables.

@johnchandlerburnham
Copy link
Author

@RangerMauve Awesome, I'm happy to make changes to my tentative draft here based on what you guys think will be most useful. It seems like there a lot of different ways to slice the cake here, and I imagine that you guys have the best perspective on what the constraints are across the ecosystem. From the Lurk and Yatima side we mainly just want a flexible/extensible way to add metadata to CIDs that IPFS implementations still know how to resolve, and the proposal by @mikeal for CIDv2 as a pair of CIDv1's seemed essentially in the same vein.

Let me know what you guys on the IPLD team think next steps should be!

@RangerMauve
Copy link

@johnchandlerburnham I'm personally into the two CIDs option myself. It'd be useful if Mikeal's team could comment on whether this works for them so we can progress further. 😁

For some of the IPLD URL use cases I was imagning it'd be useful to have the option of having arbitrary key-value pairs, but I think that's covered by having the second CID inline it's data.

@rvagg
Copy link
Member

rvagg commented Sep 27, 2022

My current thoughts on this: what we seem to be wanting boils down to essentially a combination of a link and a place to store some arbitrary properties. Those properties are likely going to be some form of namespaced value set of one or more things. Which is basically a set of key/value pairs (where perhaps the keys can come in the form of a multicodec table code, but that's still just a key, but with decent uniqueness properties).

By using a second CID I suspect we're mostly going to be using it as a place to identity encode arbitrary data, but I'd bet that'd end up looking mostly like a key/value set of one or more things.

So, is it possible to make a case to skip the complexity and extra bytes in needing to have it all encoded into a CID and just jump straight to encoding a key/pair list? e.g. a potential scheme could be:

<0x02><cidv1><list of key/pairs>

Where:

  • 0x02 is the multicodec code for cidv2
  • cidv1 is just a plain CIDv1
  • list of key/pairs is an encoded array of key=value pairs (actual encoding form tbd but if we keep key and value types stable [maybe even just bytes, or keys could strictly be varints] then most encoding schemes would boil down to one length-prefixed array and length-prefixed keys & values).

By skipping the second CID as an identity encoded block of data, and explicitly saying that we have a key/value set, we would be more likely to have an emergent set of common keys that systems could look for and describe. Identity CIDs can be any codec and the encoded data could take any form so mostly you'd have a link and an untyped object.

One argument the other way is at least the identity CID can [potentially] tell you how to decode all parts of the bytes, a key/value set leaves you with values that aren't decoded unless you know what to do with the key.

Thoughts?

@rvagg
Copy link
Member

rvagg commented Sep 27, 2022

Separate topic for discussion: limits.

We have to launch this thin with some kind of bounds. Not having this for identity CIDs continues to be a pain for so many parts of our stack (ref). My proposal is to come up with a basic byte limit, code it as a constant in our core CID handling libraries to be used wherever CIDs are decoded and error if the limit is exceeded. But, document the limit as changeable—i.e. if you can make a good case for increasing it for your use-case then you need to start a discussion. We could come up with some squishy language in the spec about this too and likely we'd want to have affordances for users to run CID libraries with their own custom limits too.

We haven't solidified a format yet and we also haven't heard enough about enough use-cases to make informed decisions about such things, but just to get noggin's joggin' my starting bid would be in the order of 2048 bytes, a number I've started imposing in some places for identity CIDs.

@Winterhuman
Copy link

In the case of the key:value idea, multiformats/multicodec#4 likely has relevance to this since you could encode mime types into CIDs potentially

@Winterhuman
Copy link

Winterhuman commented Sep 27, 2022

@rvagg One question about this layout:

<0x02><cidv1><list of key/pairs>

How would this work when expanding out the <cidv1> components? Multibase needs to be the first bytes/characters (I think), so would it be:

<mb><0x02><0x01><dag-multicodec><multihash><multihash-size><multihash-digest><list of key/pairs>

In which case, perhaps the <0x02> and <0x01> parts could be simplified to just <0x02> with CIDv1 implied, and with CIDv0 being <mb><0x02><multihash>...<list of key/pairs>?

@Winterhuman
Copy link

Winterhuman commented Sep 27, 2022

Alternatively, since making this "CIDv2" is complicated by the fact that it can contain CIDv0 and CIDv1 within itself, what about this?

<multibase><metadata-multicodec><multihash><multihash-size><multihash-digest><0x01><dag-multicodec><list of key/pairs>

The idea is to register a "metadata" multicodec which will cause legacy parsers to fail there, and then move standard multicodecs until after the multihash followed by the keypairs (in the case of CIDv0, it'll just omit the standard multicodecs and go straight to the keypairs). I'll openly admit this may be a terrible, terrible idea, but, I thought I'd put the idea out

@Winterhuman
Copy link

Something I'm noticing from all the CIDv2 proposals so far is, what does CIDv2 actually need to do?

  • Does CIDv2 need to resemble CIDv1? If so, in what ways?
  • Does CIDv2 need to support including CIDv0? If so, why?
  • What type of, and how much, metadata should CIDv2 support?

It seems that pretty much everything proposed answers some or all of these requirements, but, clearly there are other requirements which make some solutions better than others which aren't being formally declared.

@johnchandlerburnham
Copy link
Author

johnchandlerburnham commented Sep 27, 2022

@Winterhuman

what does CIDv2 actually need to do?

For the Yatima use-case, we have an intermediate representation for the Lean Theorem Prover (https://github.com/leanprover/lean4) where we separately content-address computationally relevant information from non-computational metadata (similar in spirit to Unison-Lang https://www.unison-lang.org/learn/the-big-idea/). A Yatima identifier for a declaration thus has two hashes, one for the anonymized program, one for the metadata (like variable names), such that whenever two identifiers share the same anon-id, they represent the same program. This CIDv2 proposal will allow us to make Yatima identifiers isomporphic CIDs directly, without a layer of deferencing, or by abusing the identity multihash. But we don't precisely need pairs of CIDs to make that work, pairs of multihashes with a single multicodec would be fine there

For the Lurk use case, we have at least 16 bits of metadata that need to be included with the Poseidon digest that hashes some Lurk expression. This requirement is implied by certain subtleties of how Lurk tries minimizes the number of constraints in its zero-knowledge circuit backend. For the Lurk use-case, we don't really need pairs of CIDs either, another option would be to allow users to reserve ranges in the multicodec table (and probably remove the current arbitrary 64-bit size limit on multicodec sizes).

I don't have a deep understanding of the DAG House use-case, but from what I infer from https://github.com/multiformats/cid/pull/49/files, multihash pairs might work for them.

The meta-level observation unifying these use-cases, is that multicodec as its currently structured (a single u64 you have to make a PR to Github repo to reserve), is too limiting for lots of interesting things various people want to do.

@rvagg

So, is it possible to make a case to skip the complexity and extra bytes in needing to have it all encoded into a CID and just jump straight to encoding a key/pair list?

My gut reaction to this is that its adding a lot of epicycles to what a CID is. If CIDs can carry whole maps around, they could get big, so then we have to limit their size, which then might cause users to do strange compression things to try to get their desired metadata into a size limited map. It seems kinda messy

That said, if key-value pairs can embed multihash values, then it should be workable for what we want to do, since it's then supersets the expressiveness of my original CIDv2 as a 4-tuple proposal (which is also an epicycle too, tbf).

But it feels like there might be some more general/more elegant thing that solves this problem, either in the vein of changing how multicodec works, by changing the IPLD data model to include metadata links (as @aschmahmann suggested earlier), or maybe with some new idea from the WASM encoder effort.

@johnchandlerburnham
Copy link
Author

Also, one thing I've noticed is that the way CIDv2's would nest is really weird. Suppose I have some data d_1 and some associated metadata m_1, and then I make a CIDv2 (however its implemented) out of the hashes of d_1 and m_1:

c_1 = Cidv2.mk(d_1.hash(), m_1.hash())

But then if I include c_1 in some new data d_2 (with its own metadata m_2 and construct a new CIDv2:

c_2 = Cidv2.mk(d_2.hash(), m_2.hash())

c_2 now implicitly content-addresses m_1, and if m_1 were to change, c_2 would change. So m_1 is not really purely metadata anymore, but morally its like data in d_2 (assuming everything is always available).

So actually in order to get something that behaves like a metadata CID, it seems like we need two parallel trees, one for data, one for metadata. So for example, if you have some IPLD data

[1, 2, true, "buckle my shoe", <data_cid_1>]

you could have a separate expression

[null, null, null, null, <meta_data_cid_1>]

which has the same tree shape, but holds the metadata links which correspond to the data links. This way at every point in constructing nested CIDv2's, the metadata would never "bleed" into the data hashes.

@rvagg
Copy link
Member

rvagg commented Sep 28, 2022

If CIDs can carry whole maps around, they could get big

This comment, and your whole follow-up comment actually get to one of my biggest concerns about what we're doing. Right now it is possible for a CID to carry a whole map around. Identity CIDs open a pandoras box that causes us all sorts of headaches and the initial proposal for CIDv2=CID_1:CID_2 is most commonly going to use identity CIDs in this second CID position to store entirely arbitrary data - and not just maps, whole, complex, untyped data structures whose schemas will be specific to whatever system they come out of. We'll be in a situation of wanting a registry of codecs + schemas just to figure out what the second (metadata) CID even means in isolation to its origin system.

So my suggestion was an attempt to open that up a little bit and explore the implications of just acknowledging the fact that people are going to want a bucket to store miscellaneous pieces of metadata and try and embed a basic schema of that metadata into the spec for a CIDv2. Because, if we say it's a list of "key=value" pairs in the same way as a URL querystring is, then at least there's the possibility of looking over them for some common keys that maybe you can do something with.

If we used multicodec table entries for keys, then we could have things like 0xf0f0f0 being "IPLD Schema" and know that the value should be a CID pointing to the schema to load to decode the data (let's set aside the CID-in-CID going on here for now), and 0xf1f1f1 being "WASM ADL", with a similar mechanism to load a WASM module to run as an ADL over the fetched data. Perhaps there's some other number in there we don't recognise or know what to do with and in that case it's up to the system to ignore or error on that. But the alternative to this is going to be people creating a custom object like {"my schema for this":{"/":"bafy...."},"adl to fetch":{"/":"bafy..."}}, encoding that as dag-cbor or dag-json (or even dag-pb, there are ways! OR unixfs metadata!) and making an identity CID of that to stick in that metadata CID position. Then we have to let signalling emerge from common schemas for this kind of data.


Specifically to your second post, I'm not really sure there's much we can do about this other than:

  1. Document and educate about the downsides of encoding CIDv2s into your data and make sure users proceed knowing the risks (a tough thing to do, most devs will just go with the defaults, or the shiny new thing, and just adopt it as if it should be everywhere).
  2. Integrate tools within our libraries to manage this better - perhaps a link loader could be told to carry the metadata over as it loads through links in a graph even as it's jumping through CIDv1s (e.g. I CIDv2 to the Filecoin chain data, saying "hamt bitwidth = 5", knowing none of the CIDs in the Filecoin chain are CIDv2 but expecting my link loader to persist that root context as it gets all the way down and then when it finally encounters a HAMT it can make use of it). Perhaps this could also mean that a link system and/or codec knows to strip out CIDv2 meatadata as it encodes, leaving only the CIDv1 part.

@rvagg
Copy link
Member

rvagg commented Sep 28, 2022

Separate topic, again: there was a suggestion in a meeting today that we allocate some time at IPFS Camp in Lisbon to have an in-person round-table about this and see if we can hash (har har) out a way forward. Getting Yatima, Lurk and DAG House in the same place would be good; finding other potential users would be good too - folks like @RangerMauve who are already thinking about ways to make use of this in conjunction with IPLD URLs and gateways.

@RangerMauve
Copy link

what does CIDv2 actually need to do?

For IPLD URLs, I'd like to be able to add some extra CIDs pointing at "additional" information to interpret a CID as. e.g. an IPLD Schema to interpret data as as part of traversal, a WASM blob CID for IPVM / autocodec use cases, etc.

I like the idea of having a table of predefined fields, and maybe having an escape for metdata that's super application specific and not needed to standardized on (maybe within an identity CID or a CID to the metadata). Ideally I think the table should be used for fields that we expect different applications to reuse between each other. 😅

Generally, for cases where there's some metadata that you expect will be reused across a lot of nodes, a CID pointing at the object seems efficient enough since it'll likely be cached locally already. For example, in the IPLD Prolly Tree spec that some folks and I are working on, we're planning on linking to the config in root/leaf nodes so that we can quickly tell if two trees are using the same options.

If you're doing very application specific data, that isn't going to be standardized to be used outside of the application, could it not be stored in the data without having to add it to the CID? I suppose having it in the CID could get rid of an extra step of indirection, but IMO you could just as easily reference an object that has the metadata + cid instead of just the CID and have your application know to handle it correctly.

@guseggert guseggert added P3 Low: Not priority right now kind/discussion Topical discussion; usually not changes to codebase labels Oct 13, 2022
@guseggert guseggert added the need/analysis Needs further analysis before proceeding label Oct 13, 2022
@rvagg
Copy link
Member

rvagg commented Oct 20, 2022

Tagging @hannahhoward to discuss the potential of being able to include size in a CID.

@lidel lidel changed the title IPIP: CIDv2 - Tagged Pointers IPIP-305: CIDv2 - Tagged Pointers Oct 26, 2022
@BigLep
Copy link
Contributor

BigLep commented Nov 15, 2022

2022-11-15 IPLD triage conversation: I understand @vmx that you created a related doc here. Is that right? If so, can you please link it?

@vmx
Copy link
Member

vmx commented Nov 16, 2022

Currently it's just rough notes at https://hackmd.io/@vmx/HkoYAr64o. I tried to capture all the conversations I had at LabWeek. The most promising proposal is the "application context" one. That is the one I want to spent some time on, to write it up a bit better.

@BigLep
Copy link
Contributor

BigLep commented Jan 3, 2023

2023-01-03 IPLD triage conversation: we're looking for some action items to move this forward or close it out. I've put it down for the next IPLD community call (https://hackmd.io/PjKSfch8QNOY4uNrnrRbDA?edit )

@vmx
Copy link
Member

vmx commented Jan 17, 2023

In this IPLD sync call it was decided to close this PR. The original author of this IPIP is no longer convinced that this is the right way to do it (I talked with him in person). Also talking with many other folks, we were in agreement that there shouldn't be a CIDv2, but something else. I wrote down a draft of a proposal named "Application Context" at https://hackmd.io/@vmx/SygxnMmso, which is based on the discussions I had with folks. It still have one more idea I need to write down that was floating around, once done, I'll link it from the hackmd mentioned above.

@vmx vmx closed this Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/discussion Topical discussion; usually not changes to codebase need/analysis Needs further analysis before proceeding P3 Low: Not priority right now
Projects
Status: Deferred
Archived in project
Development

Successfully merging this pull request may close these issues.

9 participants