Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size limit of identity hash #130

Open
vmx opened this issue Jul 9, 2020 · 62 comments
Open

Size limit of identity hash #130

vmx opened this issue Jul 9, 2020 · 62 comments

Comments

@vmx
Copy link
Member

vmx commented Jul 9, 2020

As I've been working on the Rust implementation of Multihash, it came up that the identity hash currently doesn't specify any limits. From an optimization perspective (this is why it came up in Rust), but also from a security perspective I think it would make sense to specify an upper bound for its size.

I personally would take a quite low limit which is similar to what current hash functions have as length. So perhaps something around 64 bytes?

@ribasushi
Copy link

ribasushi commented Jul 9, 2020

64bytes is most definitely a non-starter, as it already won't fit the prefix/varinting of a 512bit hash.

More generally: any data that you know won't ever be repeated is a good candidate for inlining. Also an upper limit already exists: the limit of a network block itself ( 1MiB soft, 2MiB-1 hard ).

It would make me sad if we put an arbitrary small limit bound by the ability to print out a CID. Also note that other visible system parts, e.g. dirnames have no limits at present.

Last but not least: there are currently production systems that do inline ~2k or was it ~4k of data on the dht today ( /cc @ianopolous )

@vmx
Copy link
Member Author

vmx commented Jul 9, 2020

Also an upper limit already exists: the limit of a network block itself ( 1MiB soft, 2MiB-1 hard ).

That's only a current IPFS limitation. It's not a general limitation if you look at Multihash in isolation.

To me identity hashing makes sense to save space if the hashing output it bigger than the input. If you use it for something else, then to me it has the smell of some hack around some limitations that could/should be solved on a better way.

@ianopolous
Copy link
Member

There are already limits in the go code. We've been using it to inline stuff in Peergos <= 4096+16 bytes (after discussing with core ipfs devs, but I can't find the issue right now @Stebalien, @whyrusleeping ? ). There are earlier discussions on this:
ipfs/kubo#4918
multiformats/cid#21

cbor-gen also seems to have just added a limit of 100 for this:
whyrusleeping/cbor-gen#24

@ribasushi
Copy link

ribasushi commented Jul 9, 2020

@vmx 2MiB-1 is a libp2p limit - it permeates everything ( but you are correct: multihash in isolation is limited by uvarint alone )

It is generally correct that inlining is there to realize savings, but they come in more forms than just space:

  • absolute space saving ( the one you had in mind ): anything shorter than cid-prefix+hashbits is a win if you get to inline it
  • storage/index space saving: multiple-logical-blocks packaged as a single physical block by recursive inlining aids storage-page utilization/packing
  • transport space/time savings: a recursively-inlined dataset can be seen as a "mini-graphsync optimization": i.e. you ask me for node X => you automatically and invariably get its contained nodes Y Z Q and R, enjoy!
  • operation savings: if you have a construct with multiple nested known-random entries ( i.e. a set of public keys ), having each entry addressable by CID but not having to do an encode/decode pass is likely valuable in contexts I can't think of right this moment. Perhaps @ianopolous can go further into their 4096 + 16 use-case...?

TLDR: We definitely should define a formal limit. I definitely think this limit needs to be in the high kilo byte range, way beyond the proposed 64 bytes.

@vmx
Copy link
Member Author

vmx commented Jul 9, 2020

way beyond the proposed 64 bytes.

I care more about having an actual limit, than what the limit actually is.

@ianopolous
Copy link
Member

I found the discussion I was referring to:
ipfs/go-cid#88

@Stebalien
Copy link
Member

We generally assume that CIDs are small and cheap to handle. CIDs are designed to be identifiers and inline CIDs are designed to allow inlining objects that are smaller than the identifier. Keeping this assumption valid is extremely important for performance and usability.

  • Allowing for large (more than 100-128 byte) CIDs makes it very difficult to reason about sizes.
  • CIDs show up in paths. Ridiculously long garbage in paths is not user friendly.
  • CIDs now show up in subdomains. The current plan here is to re-hash large CIDs in subdomains to make them fit (resolve the object, then create a new CID with a smaller hash digest) but that will impact content routing.

Last time we discussed this, we set the default limit to 32 bytes to be safe and marked the relevant options in go-ipfs as experimental.

Really, it sounds like the need here is for a way to pass around blocks with type information, right? We may need to distinguish between the two cases.

@ribasushi
Copy link

The current plan here is to re-hash large CIDs in subdomains to make them fit (resolve the object, then create a new CID with a smaller hash digest) but that will impact content routing.

Afaik this part got entirely nixed, and the current implementation will not support over-long CIDs ( /cc @lidel )

@mikeal
Copy link

mikeal commented Jul 9, 2020

Where/how would we enforce the limit? We’ve had very similar threads about block size limits and ended up punting any hard limits to the network and storage layer, which in this case isn’t really an option since they rarely decode the block data.

Also, just as a matter of fact, many of our existing libraries don’t handle identity multihashes in CID’s correctly and will hand the Link/CID type off to the storage layer to get as if it were any other CID. Similarly, our Block interfaces do not all support data encoded to/from the identity multihash. The strategy for representing and considering “inline blocks” is still in the experimental phase when it’s handled at all.

This is a good time for this discussion given that inline block support is still under development, but I want to make sure that we’re setting the right expectations about where this might land.

@Stebalien
Copy link
Member

Where/how would we enforce the limit?

When decoding/validating CIDs?

Also, just as a matter of fact, many of our existing libraries don’t handle identity multihashes in CID’s correctly and will hand the Link/CID type off to the storage layer to get as if it were any other CID. Similarly, our Block interfaces do not all support data encoded to/from the identity multihash. The strategy for representing and considering “inline blocks” is still in the experimental phase when it’s handled at all.

Actual support in block stores is just an optimization. Inline CIDs are still usable as normal CIDs.

@vmx
Copy link
Member Author

vmx commented Jul 10, 2020

We identified two cases:

Data is smaller than the hash would be

I'd call this one the original use case. There seems to be agreement that for this use case, there should be a really low limit, in the bytes, rather than KiB range.

Inlining blocks/blocks with type information

This is the Peergos use case. It's more of an optimization.


Now the question is. Could we come up with a solution for the Peergos use case, which Peergos could upgrade to, while limiting the identity hash to a small size as some libraries already do (and also I'd be in favour of).

@ianopolous
Copy link
Member

It seems clear to me that there is a breaking change coming here. So, for what it's worth, I've come up with a backwards compatible solution and we're stopping using identity multihashes to inline small blocks in Peergos now. We can't migrate existing users (we don't have their keys), but I've realised that those that are on our servers are actually ok now because we are using the S3 datastore and managing pins ourselves (indeed if they're logging in through our server they're not even using ipfs at all). Any new users, new data written, or old data modified will not use identity multihashes. I think we can write a gradual migration that runs when they log in. It's painful, but the risk is too high for us.

The largest identity multihash we use now is 36 bytes, for public keys (which includes the type of key and key material all encoded as cbor). And we now enforce this as a hard limit for new data. In the future we'd like to use identity multihashes for CSIDH public keys (64 bytes + multikey header), but that is a future discussion.

@mikeal
Copy link

mikeal commented Jul 10, 2020

multihash is a general purpose standard with general purpose libraries. Given how young the project is, we should assume that the use cases we currently understand are not a complete set and we need to make sure we stay open and accessible to being used for things in the future that we haven’t even thought of yet.

With that in mind, I don’t think that we should:

  • Create and enforce limits based on our own opinions about what is “good practice.”
  • Adopt and enforce limits on behalf of specific use cases.

If there’s a universal reason to have a limit on the size of a multihash that we’re confident is always going to be true, then we should adopt it, but that’s not at all what I’m seeing.

If, as we already know, IPFS wants to use CID’s for subdomains and therefor needs to enforce a size limit on CID’s which in effect limits the size of a multihash, that’s fine, but that’s IPFS’s decision to make and their limit to enforce. That doesn’t belong in the core of multihash because it’s not universally representative of all the use cases someone might build on multihash.

@vmx
Copy link
Member Author

vmx commented Jul 10, 2020

With that in mind, I don’t think that we should:

  • Create and enforce limits based on our own opinions about what is “good practice.”

I think we should exactly do that. This will prevent systems to randomly blow up when some implementation impose arbitrary limitations. The nice thing about having a limit in multihash is, that you can highly optimize it. You would always know the upper bound of the supported hashes, hence e.g. do things with stack allocations only. This is not possible if you want to support the Identity Hash with maximum compatibility, which would mean 8EiB.

@mikeal
Copy link

mikeal commented Jul 10, 2020

The nice thing about having a limit in multihash is, that you can highly optimize it. You would always know the upper bound of the supported hashes, hence e.g. do things with stack allocations only. This is not possible if you want to support the Identity Hash with maximum compatibility, which would mean 8EiB.

Then don’t “support the Identity Hash with maximum compatibility” ;)

Users and implementations are free to make domain specific decisions about these limits, the right decision for one user will not be the same for another. It’s not the job of the underlying primitive to make these decisions on your behalf because we don’t know what each user’s requirements are.

Look at the block limit, to my knowledge only one transport has a real block limit and yet pretty much every user imposes block limits at half the current transport limit because we called it out as a good practice. It’s not a hard requirement in the spec and it’s not enforced by our codec libraries, but it’s a functioning limit everywhere that it matters.

I’m not saying we shouldn’t define good practices, and even document what we think is a reasonable target limit for multihash, but we shouldn’t impose that limit in these libraries at that layer or call it out in the specification as a hard requirement.

If @ianopolous wants to have big multihashes in his CID’s, he shouldn’t have an issue at the multihash layer, even if he will have issues at the IPFS layer. In the same way that I can create 5MB blocks for a one-off use case knowing that if it ever needs to be used in Bitswap it’s going to break.

@mikeal
Copy link

mikeal commented Jul 10, 2020

Now the question is. Could we come up with a solution for the Peergos use case, which Peergos could upgrade to, while limiting the identity hash to a small size as some libraries already do (and also I'd be in favour of).

I struggle to see how these inline use cases would exist in dag-cbor and newer codecs. From what I can tell, this looks like a workaround for some limits in dag-pb or perhaps in unixfsv1 (I haven’t gone deep enough to know for sure).

You can “inline” node data into the block using any codec that supports the full IPLD data model without hacking it into the CID. I can’t see the utility here other than “we forgot to make this part of the data structure a union” and it seems like the right thing to do there would be to fix the data structure to support that because it’s a lot more complex to deal with data that has been inlined into the multihash. I understand that in the case of dag-pb we may not be able to change the data structures, but that doesn’t mean we should port this practice over to users that have access to the complete IPLD Data Model.

It’s not that inlining data isn’t a common and necessary feature, it is, that’s why we fully considered it in dag-cbor and in the IPLD Data Model and have a compelling feature set. If this pattern is common enough we could even consider adding syntax to IPLD Schemas to make kinded unions on links easier, similar to how we have syntactic affordances for making links in general easier.

But, across a lot of our code, data inlined into the multihash throws a wrench in our layer model and is difficult to support across different Block, Link, and storage interfaces. Most code thinks of a CID as a key and its data living somewhere that it can retrieve by that key. If you put the data in the key, the representational pairing of [ key, value ] is lost and there’s not a very clean way to maintain the interfaces without pushing this to users (which is what happens currently, if you put data in the multihash you’re going to be pulling it out and working with it very manually).

@ribasushi
Copy link

ribasushi commented Jul 10, 2020

If you put the data in the key, the representational pairing of [ key, value ] is lost

why do you say that? Nothing in e.g. ipld-prime land would change:

  • you encounter a link with CID \x01\x71\x00\x2A{{ 42 bytes of cbor }}
  • you go to the blockstore link loader ask for that
  • the blockstore link loader turns around instantly and gives you back {{ 42 bytes of cbor }}
  • profit

@ianopolous
Copy link
Member

ianopolous commented Jul 10, 2020

I struggle to see how these inline use cases would exist in dag-cbor and newer codecs. From what I can tell, this looks like a workaround for some limits in dag-pb or perhaps in unixfsv1 (I haven’t gone deep enough to know for sure).

Everything we do is dag-cbor - this has nothing to do with protobuf or unixfs.

It is much much more elegant to do it our way (although elegance wasn't our motivation, speed was) because then the same object class always maps to the same ipld structure. The way I've worked around it is to now have two distinct cbor ipld encodings for the exact same type (class) of object, and handle both types explicitly in the deserialization. So the same type of object now has two totally different ipld strutures.

it’s a lot more complex to deal with data that has been inlined into the multihash.

It was trivial to support this for us (4 lines of code globally). There's nothing "manual" to it.

@mikeal
Copy link

mikeal commented Jul 10, 2020

link loader

That’s specific to Go, where the link loader is an abstraction between the storage layer and the decoded node layer. We don’t have that in every language, often the node layer just talks directly to the storage layer, which means every storage API needs to handle this or every line that asks for data by CID needs to handle this.

Also, @warpfork will need to weigh in, but I recall him mentioning that there are plenty of things in go-ipld-prime that won’t work well when inlining data this way.

It is much much more elegant

The solution we spent considerable time working through to this problem is unions (mostly kinded unions using Link as a kind). It’s a core feature of IPLD Schemas and translates nicely into every programming language and all the abstractions we’ve built.

It’s problematic to have multiple approaches to inlining data and a kinded union provides a much cleaner approach that keeps the type differences clear to everyone. It sounds like you actually want to blur the line a bit on the type differences so I can see how that approach with be more attractive, but as we build out generic libraries it’s rather difficult to have a single type mean very different things.

That said, we’re not going to break or disallow anything that is valid CID/multihash, we just may not have a very nice interface for you to use when you inline data this way, which you probably don’t care about since you have your own libraries ;) And as I’ve already stated, I’m rather opposed to setting a hard limit on multihash size in the specs or core implementations. Some libraries and consumers may set limits you’ll have to contend with and I suspect languages or libraries that want to optimize memory allocations will set a configurable limit, none of which are an issue if you were to take the kinded union approach instead.

@ribasushi
Copy link

ribasushi commented Jul 10, 2020

It’s problematic to have multiple approaches to inlining data and a kinded union ...

I think this is where the disconnect is. Identity CIDs operate on a layer below where a "kinded union" would exist, they are strictly in "codec-land".

To put it differently: from link-traversal perspective there is no practical difference between:

  • a 4096-byte identity CID
  • a 32768-bit output from keccak's shake functions

We currently support both. The proposal is to limit only one of them, on account of one of them being special. How do these 2 examples differ?

@mikeal
Copy link

mikeal commented Jul 10, 2020

I think this is where the disconnect is. Identity CIDs operate on a layer below where a "kinded union" would exist, they are strictly in "codec-land".

Once you recognize that links in a node graph are transparently traversed, that the link is resolved to a node that replaces the link representation to become the node representation for that property in the parent node, they are functionally equivalent. Both put the node data in the same place and stored in the same block.

Conceptually, this never exists in the decoded node graph:

ParentNode -> Link -> ChildNode

Instead, this is what happens:

// before resolution
ParentNode -> Link
// after resolution
ParentNode -> ChildNode

You can observe this in our pathing, where named properties that are links get resolved to their decoded node value. There’s actually no way to return a link from a fully resolved path.

let value = 1234
let link = Link( value )
{ property: link }

If you resolve the path /property of this block you’ll get 1234, there’s no pathable reference to the link itself.

@vmx
Copy link
Member Author

vmx commented Jul 13, 2020

Some libraries and consumers may set limits you’ll have to contend with and I suspect languages or libraries that want to optimize memory allocations will set a configurable limit.

That's exactly the issue, when we don't specify a limit. Your application would work on one implementation, but not by default on some other. Finding out that one of your identity hashes is to long, sounds like a very hard to find bug if you are not aware that there might be a limit.

The option might be configurable, but it might not even be configurable at runtime, but at compile time only.


Having a limit makes building systems that work everywhere easier, not having a limit makes it harder with little benefit.

Storing data in Multihash to me sounds wrong, that is not what Multihash means to me. Though it's kind of a neat hack, so if we want to support large data (still with a limit, but a high one) stored within a Multihash, I suggest we introduce a new codec for that. This way codecs can still support the "the hash of my data is bigger than the data itself" use case, while not being forced to also support "i store data in my multihash" use case.

This way also the error reporting will be clear, as you are now see that your codec is not supported instead of having it fail in weird ways.

@warpfork
Copy link

warpfork commented Jul 13, 2020

I think a lot of the important things have already been said here, but I've been called out, so I guess I feel I ought to weigh in on the record, heh. I'll mostly just re-highlight things that have already been said that I agree with and want to boost, though:

  • It is true that we have typically eschewed hard limits in IPLD, while acknowledging that they exist in systems we're likely to be used together with (such as block size limits from IPFS / libp2p), and advising that people probably want to stay within those limits unless they're confident their use-case means they aren't worried about this. I think we have no regrets on this approach so far!

  • It is simultaneously true that we may want to note a size boundary recommendation, per the same logic as the previous point.

  • I'd probably vote for having length limits be a parameter of which multihash we're talking about.

    • FWIW: I've actually always been a little surprised we have length parameters on all of the multihashes to begin with. I see why it's a parameter for e.g. blake2 variants; it was originally quite a shock to me the first time I saw it was also present for, say, plain ol' sha256.
  • I strongly agree with @Stebalien 's observations: "We generally assume that CIDs are small and cheap to handle. Keeping this assumption valid is extremely important for performance and usability. Allowing for large (more than 100-128 byte) CIDs makes it very difficult to reason about sizes." Yep; yep; and yep. I am not a fan of large values appearing in CIDs.

  • Inline CIDs... okay. Let's break this down into a few sub-bullet points:

    • I don't think inline CIDs were a good idea, and I don't think we'll be encouraging anyone to use them in a year's time. (I don't think I'd encourage anyone to use them now, in fact; and I might place a bet on the table that we'll be actively and overtly regretting them in a year's time.) The following points are why:
    • We have a mechanism for talking about a choice between "inline" data and links now! In the IPLD Data Model, you can do this freely; it's so innately up-to-you that it's barely possible to talk about a world in which it's... not. (This is what @mikeal is getting at in some of his comments. Inline CIDs might've made sense in some applications based around dag-pb, I think, maybe? But that world is... something we should regard as fading into the past very quickly: the IPLD Data Model is much more expressive than dag-pb.)
    • We have a well documented and explicitly tool-supported way to choose between inline and linked data, as well: IPLD Schemas have copious support for unions, and various forms of indicating them: including some which simply differentiate based on whether the serial data is a link or some other Data Model kind. (These are the same concepts as you can use without the Schema layer; I highlight it in terms of the Schema system only to emphasize how very, very supported it is.)
    • These mechanisms for choosing whether you're using links or "inlining" the data (by just... not putting it into a "link" format at all) work without causing additional serialization calls just to put the data into bytes for an "inline CID", which is... superior.
    • These mechanisms for choosing whether you're using links or "inlining" the data are exposed very clearly at the library layer: you just... manipulate your data using the standard Data Model semantics, which resemble an AST. These continue to work and compose even when you're doing deep traversals, etc.
    • Contrariwise: we do not have particularly good and clear ways to do high level operations on large graphs of data, whilst constantly deciding which links are going to be encoded as inline or not. It's possible... but I wouldn't say it's smooth, or a corner that we get a better developer experience by regularly asking people to shine a light into that crevasse.
    • Given all the other caveats about inline CIDs discussed above (e.g., big sizes cause various major usability questionmarks)... really, there's just... I struggle to think of much good at all that comes from inline CIDs. We have clear alternatives; I can't imagine we won't grow to prefer them.
    • n.b. I don't want this screed to be mistaken for a claim we should drop support for inline CIDs either; that's a stronger case than I'm willing to make. But I think it's also important to make remarks on how much energy we want to put behind supporting them nicely. IMO: a very limited amount; and I definitely don't think we should let them have predominating influence on any other design choices we wrangle with.
  • "There’s actually no way to return a link from a fully resolved path." -> okay, this is true, but I do actually regard that as probably a design bug, fwiw :) I'd like to design our way out of this. The blind spot has hurt us surprisingly little so far, but it does bother me.

    • The rest of the arguments made in that comment still stand though! Even though I want to design our way out of this blind spot, my current suspicion is that we'll make "advanced pathing" variants that allow introspecting links, but they'll not be the default path we lead users down in most cases.

@ianopolous -- I'd be happy to talk to you more synchronously about this if you'd like, but in essence,

The way I've worked around it is to now have two distinct cbor ipld encodings for the exact same type (class) of object, and handle both types explicitly in the deserialization. So the same type of object now has two totally different ipld strutures.

... that sounds less like a "workaround" and more like "exactly the right thing" to me :)

Maybe there's some different way to organize the type definitions in your language of choice that would make it more natural? Dunno; I'd be willing to look at it with you though, if you'd like. I'll also say: goodness knows golang hasn't been making it exactly easy on me to represent unions either! But it's been a logically sound path to pursue, even when my host programming language hasn't made it frictionless. So far, every time push has come to shove, I've been very happy with the outcomes stemming from our pursuit of unions.

@warpfork
Copy link

(Perhaps that comment would be easier to read if we introduced some term other than "inline" for when we put data in the same block rather than a separate block -- I used it describe that general practice in the same comment as discussing "inline CIDs", and that's probably confusing. Forgive me, reader. A better phrasing has not, at this moment, yet occurred to me.)

@warpfork
Copy link

warpfork commented Jul 13, 2020

To follow up a little more concretely on that allusion to "we have a way to talk about links vs inline embeded data now" --

This is a schema snippet that we could use to describe this common scenario:

type ThingSomewhere struct {
    foo String
    bar Bytes
}

type ThingHereOrNot union {
    | ThingSomewhere map
    | &ThingSomewhere link
} representation kinded

type EnclosingFwoop struct {
    couldBeEmbededOrBeLink ThingHereOrNot
    otherData String
}

A block containing one object matching the EnclosingFwoop type could have either zero or one links in it, and is still described by this schema in either case, and its transitive graph (if it does contain a link -- or just the block itself, if it doesn't) contains one ThingSomewhere.

At no point were inline CIDs harmed used in the description of this data; and yet, we have choice over whether or not the data is split into two blocks or not.

I find that this is a very straightforward way to describe this situation: and it works fine with arbitrarily complex structures of data; it works purely in ways that are easy to describe in terms of the Data Model (e.g., without needing to discuss using different variants of CIDs); and because of this simplicity, I think this is generally the sort of approach I would recommend essentially all new code to take in preference to inline CIDs, if at all possible.

I've used the schema syntax here only to clarify and describe. It is not necessary to use schemas to do this; one can simply construct data and walk over the data model according to this convention.

I don't actually know what reasons there would be to prefer using inline CIDs over using this much simpler model of "embed if you want". If there are some very specific situations that really require using a CID for ${external opaque reason}, maybe that's something we should document in a short list of known situations? My suspicion is that list is going to be very short and have the general feeling of being enumerating "exception rather than the rule".

@ianopolous
Copy link
Member

I can explain our case in detail if it helps. Note that we have a work around/"exactly the right thing to do" as mentioned above.

One of our fundamental objects has a field which is ciphertext, so a structureless byte[], which is limited to 5 MiB. This thing can be either a directory or a file, and indeed we explicitly hide this from the ipld layer (you have to decrypt to decide which). We represent this as a class with a field which is a list of cids. Whenever we want the actual ciphertext we just pass the list of cids to the datastore and get back the results (identity hash or not). However, as I mentioned above, this thing can represent a directory, and this means that when we are traversing a path we must retrieve many of these objects, decrypt the ciphertext and recurse. The critical point in our model is that we aren't assuming the datastore is local, normally it is on a remote server. This means that inlining the directory (or small file) data is critical to speed for us because it results in many fewer network round trips/ DHT retrievals.

I'm fine with not using identity mulithashes here, as we're about to migrate towards, but it does result in more complicated code, and I wish there had been a limit or indication that these were meant to be < 100 bytes when it was released in go-ipfs.

On the general topic of having a limit. I am very strongly pro having a well defined limit. The reason being that without such a limit we have a system with a primitive that is basically undefined. If some system imposes limits then the end result is a fragmented ecosystem of applications and language implementations that can't talk to each other.

@ribasushi
Copy link

@ianopolous so to make sure I understood you right: basically your use case is point 3 here: #130 (comment) ?
Your "bonus" of using inlined CIDs is the ability to not care for a predefined "schema" of the structure, so that "everything is a block" to your unwrappers/decryptors?

Or in other words: you use inline CIDs to logically separate the "transport/decrypt/low-level-decode" codepath from the "semantic high-level decode" one?

@ianopolous
Copy link
Member

@ianopolous so to make sure I understood you right: basically your use case is point 3 here: #130 (comment) ?
Your "bonus" of using inlined CIDs is the ability to not care for a predefined "schema" of the structure, so that "everything is a block" to your unwrappers/decryptors?

Or in other words: you use inline CIDs to logically separate the "transport/decrypt/low-level-decode" codepath from the "semantic high-level decode" one?

@ribasushi Yep that's right.

I should also add that the reason it is 4096 + 16 is that we pad all plaintext to a multiple of 4096 before encryption to protect the metadata around size. (And the 16 is the encryption overhead)

@mikeal
Copy link

mikeal commented Jul 13, 2020

Your application would work on one implementation, but not by default on some other.

This is true of pretty much every protocol. TCP, UDP, HTTP, etc, there’s no size limit in the protocol specification for the total data transferred, but every service provider and implementation sets one. These limits don’t seem to negate the benefits of agreeing on common protocols and clients learn to live within reasonable limits the ecosystem of service providers have set.

As an example: I have a script that does GraphQL queries to GitHub’s service and if the request takes too long the gateway kills the connection even though the query was well within the rate limit GraphQL and even their HTTP service set. Service limitations are application and provider specific, and they are applied to all the protocols you touch, we can’t enforce them for everyone or even hypothesize about what all the use cases are. For a lot of people, setting a block size limit of 1mb solves any concerns they might have about large CID’s as a side effect. For others, maybe not.

I’m very interested in recommending a reasonable size limit and would expect many consumers to adopt it (similar to how we handled block size limits) but to set a hard limit in the standard is too much for me.

@mikeal
Copy link

mikeal commented Jul 14, 2020

However, I think it's a big problem to have something that is effectively a spec (i.e. the IPFS blake3 multihash is at most 1KiB) but not have it expressed explicitly via the codec.

A surprising number of hash functions are of variable size so it’s not really practical to do this.

Something @ribasushi brought up a few months back is that we need to put a size limit in our code for pretty much every hash function. Since these recommendations vary per hash function, we may want to capture this in another table, or expand the multicodec table, because the expected/recommended max size for most hashing functions doesn’t vary by programming language.

We haven’t implemented this yet, but we intend to.

@aschmahmann
Copy link

aschmahmann commented Jul 14, 2020

A surprising number of hash functions are of variable size so it’s not really practical to do this.

What part is impractical about doing this? We may have a few (i.e. 1-4) hash functions, but the larger their sizes are the more prefix bits we can give them. Our key space is effectively unlimited if we're not focusing on giving out prefixes that are 1-2 bytes.

We gave away 100s of spots to other multihashes in the past many of which are probably unused (looking at you blake2s-72). What's a giving out a few per hash function where the prefix length assigned increases with minimum multihash size?

we need to put a size limit in our code for pretty much every hash function

If that needs to exist in code and we want code from different projects to be compatible then the code requires a spec

Since these recommendations vary per hash function ... expected/recommended max size for most hashing functions doesn’t vary by programming language

Then let's just assign the recommendations to a codec and we're done.

expand the multicodec table

How is this different then my proposal?

@mikeal
Copy link

mikeal commented Jul 14, 2020

Every table entry has a cost as it represents another barrier to compatibility between implementations, as each implementation must be configured with a hashing function that matches the identifier. If a hashing function has a variable
length we can maximize the compatibility between implementations by leaving an affordance for the variability in length
associated with only that one table entry and hashing function, making far more implementations compatible with it.

We cannot expect every multiformats implementation to have support for every hash function. We also cannot arbitrarily limit the number of new hash function entries in order to try and reduce proliferation because people who are using obscure hash functions have grounded reasons to do so and will likely sacrifice multihash compatibility if necessary. The only thing we can do is allow for optionality where we can, which is in the length. As far as I can tell that’s the only place we can, so we should.

@mikeal
Copy link

mikeal commented Jul 14, 2020

How is this different then my proposal?

I’m recommending that we add a column to the table to capture recommended max sizes. What I think you’re recommending is adding new entires for each hash+length.

@aschmahmann
Copy link

aschmahmann commented Jul 14, 2020

If a hashing function has a variable length we can maximize the compatibility between implementations by leaving an affordance for the variability in length associated with only that one table entry and hashing function, making far more implementations compatible with it.

How?

If you're talking about library compatibility, such as between go-multihash and js-multihash then what's the big deal? Either way we're going to end up with code like this to take into account the fact that we already allocate 100 blake hashes. Having libraries internally have some function like addCodecWithSizes(codec, []size) is IMO not a big deal, especially since we effectively require it already due to blake2 and skein.

If you're talking about application layer compatibility then IMO having a specified length instead of an apprehensive documentation note is much better. Without it the probability that two multihash based applications end up incompatible with subtle bugs seems quite high.

There's another benefit to using the multicodec table. When a new user shows up and says "I see you support the identity multihash up until 512 bits, but I'd like 10k please give me a codec" we can say "sure we're happy to assign a codec, but there's a reason nobody's using it so far do you want to discuss your use case".

@ribasushi
Copy link

@mikeal @vmx I think what is @aschmahmann proposed is actually quite practical ( even though I dislike it for a reason detailed below ), if we attach sizing to multiformats table position. Just like unicode has predefined "planes", we could ( without breaking too much ) define the following:

  • Any multihash id that has a length of 4 bytes or less when encoded as a varint, implies that the "hash/payload size" is <= 127 ( handwave )
  • Any multihash id that has a length of 5 bytes when encoded as a varint implies that the "hash/payload size" is <= 65535
  • ... continue series until 9 bytes and a gigabyte of payload or whatever

I really dislike this because we are limiting stuff that was previously open, but if this will make all sides moderately happy: I think it's a compromise worth considering.

What I take issue with however is:

we can say "sure we're happy to assign a codec, but there's a reason nobody's using it so far do you want to discuss your use case".

This implies too strong "table ownership" dynamics. We are stewards at best. If a user wants to take the cost of a large CID, ad the penalty of a large multihash code: they should get one, no questions asked (asking as curiosity: sure, but not in "justify yourself" terms ).

Unless the conversation is shifting from "we want to give implementers reasonable assumptions to work with" to the more authoritarian "I think doing X is bad, and I don't want to see anyone do it using my table". If we are having the latter conversation: we already lost our way.

@aschmahmann
Copy link

Any multihash id that has a length of X bytes ... implies that the "hash/payload size" is <= Y

I don't know if we really need to define that as anything more strongly then a general policy (as opposed to a rule we could encode in a lang-multihash library). However, the general principle of leveraging the size of the multicodec table to allow specifying as many things as we want is indeed what I was getting at.

This implies too strong "table ownership" dynamics.

Fair enough. I agree we should assign codecs if people ask for them without needing a "reason" (as long as they're not doing things like asking for 1000 low bit codecs).

However, as people who are involved in the project it may serve as a useful heads up to let a project know that while they can happily use a new codec what some of the tradeoffs involved might be (especially if it's as simple a change as increasing the hash length).

@vmx
Copy link
Member Author

vmx commented Jul 15, 2020

I’m recommending that we add a column to the table to capture recommended max sizes.

@mikeal I'm against this. This would lead to discussions like this one. Which recommended size would you pick for the identify hash. It might also make implementations more difficult. I would then try to meet those recommendation. If we just say there are limits, I can keep it simple and e.g. limit all my hashes to a single maximum value. That's easy to explain and easy to reason about.

@mikeal
Copy link

mikeal commented Jul 15, 2020

@vmx we do not want to impose a cap for SHA2-256 of anything other than 256, and 512 for SHA2-512, and so on. Some hash functions have a known fixed length that should be imposed and some are variable, that’s why this needs to be another column, because it’s a bigger problem than just this generic max size.

@whyrusleeping
Copy link
Member

Wow, long thread. I just want to add that the 100 byte limit I imposed on ScanLinks in cbor-gen (here) is fairly arbitrary, and aimed at protecting certain performance critical code from overallocations (note: the const is only used inside of ScanLinks, the rest of the codebase imposes a limit of 512 on cids, another fairly arbitrary limit to help protect against DoS).

I'm happy to change these limits (or make them configurable per use case) as people see fit.

@vmx
Copy link
Member Author

vmx commented Jul 16, 2020

some are variable

Yes I know. Though I guess it should follow the same logic as the identity hash, that we cannot really make good judgements about those limits. In case we want per hash recommendations (I don't), I wouldn't add another column (as it also doesn't apply to non Multihash Multicodecs), but just add it as a comments.

@ribasushi
Copy link

Re-pinging all participants, especially with the newly attained understanding of where identity CIDs really shine.
What should we do here?

@vmx
Copy link
Member Author

vmx commented Jun 21, 2021

@ribasushi Can you please elaborate on where identity CIDs really shine.

Side-note: I'm still not sure how far the concept of CIDs should be stretched. I like the simplicity of having an identifier for content-addressed things. To me "identity CIDs (I'd call them "inline CIDs") is a different concept. It makes use of multiformats (which is great), but are they really CIDs anymore? In order to make that work, we not even need to stretch the concept of CIDs (by not addressing the content, but inlining it), but also the one of Multihashes. They are (according to the current spec) about "cryptographic hash functions", which the identify function certainly isn't (e.g. it misses the pre-image resistance). So perhaps there should be a new multicodec called "IID", an "Inline Identifier", which is just: <inline-identifier-multicodec><ipld-codec><length><data>. And perhaps there is then a "UID", the "universal identifier", which understands the multicodecs CIDv0, CIDv1 and IIDv1.

@mikeal
Copy link

mikeal commented Jun 21, 2021

We have recommendations for block size limits but ultimately it’s up to different protocol implementations to enforce them. I don’t see a compelling reason to recommend or try to enforce any limits beyond that. Identity CIDs live in blocks and can accumulate quickly to exhaust block limits. Given that we have no real enforcement mechanism let’s just leave it as is.

@Winterhuman
Copy link

Winterhuman commented Aug 31, 2022

Adding to the discussion, why limit the size of identity multihashes when you could limit the size of the things that contain identity multihashes? Here's my thinking:

  • DNS has it's own limits, both for subdomains and paths, which restricts the use of identity multihashes for gateway resolution.
  • Identity multihashes can be within DAG object links, however, there's no real way to "chunk" a CID so the max size of an identity multihash is effectively the chunk size minus the other data in the DAG object (though perhaps even this is too large).
  • And lastly, the Peer, Provider, and IPNS records should be given their own size limits, which then restricts identity multihashes to the record size minus the other data in the record.

Beyond DNS, DAGs, and IPFS records, identity multihashes don't see much use, and from there the size restrictions should be determined by the developer that intends to use them outside these typical set of use-cases.

As for developers making multihash parsers, the spec should include a minimum identity multihash size they must support (maybe make it equal to the largest non-identity multihash available?), however, it should be up to the developer to decide on an upper limit depending on the performance/security considerations that are being taken for the particular implementation.

What do you think?

@rvagg
Copy link
Member

rvagg commented Sep 1, 2022

Hey, I forgot about this discussion (thanks for reviving it @Winterhuman) but I've had my head in identity CIDs of late so this is relevant.

We're dealing with some Filecoin retrieval issues involving identity CIDs which have been made by various tooling, most notably Lotus itself when creating UnixFS data for you with an import! https://github.com/filecoin-project/lotus/blob/088bf56f2a9a9ad6ae031b9af867cefec8ed4245/lib/unixfs/filestore.go#L31-L42

I've seen this code copied in a couple of places, so we must have a cultural meme that's stuck with some folks that identity CIDs are a good idea when the blocks are <= 126 bytes. So this is happening, in the wild, and any limit should at least take that into account.

The retrieval problem with Filecoin is interesting because it involves DAG root CID that are identities and are skipped by various parts of the storage and indexing layer because they are identity CIDs. But because they're overlooked, when a user shows up wanting to retrieve a DAG that starts as an identity CID (typically with multiple links in it), the retrieval fails because it can't find a stored piece with that CID in it.

So a top-layer solution (there are other solutions to be explored focused on indexing, but for now we just want to make these retrievals work) is to accept identity CIDs and handle them as if they are a request for a piece containing all of the links in that CID and then start a retrieval based on that from a matching piece.

But in the process of implementing that we have to make some decisions about limits, because we're not just dealing with the <=126 byte case, we're essentially opening up another retrieval mode where a user can craft their own identity CID containing lots of links and be requesting that a storage provider find a piece that happens to have all of those links in it. Having it open ended isn't really ideal because that could be a lot of work and from a client perspective just making a query is cheap and we don't want to make it expensive for storage providers.

I've come up with some limits, but they're kind of arbitrary, but they at least feel like a compromise between "don't make me do too much work" and "this could be a useful mode for multi-DAG or just multi-block retrieval": filecoin-project/go-fil-markets#747

  • 2048 bytes maximum
  • 32 links maximum

You can pack 32 average sized CIDv1s into a dag-cbor identity CID for around 1320 bytes.

Too much? Too little? 🤷 But this is a specific path, so the limits here don't necessarily reflect what a reasonable limit might be for other uses.

@Winterhuman
Copy link

@rvagg It's definitely a use-case to keep in mind, and it also helps demonstrate one of my points.

It would be much easier to just give a minimum max size of identity multihash that all multihash parsers must support than giving a definitive maximum size. For instance, your use-case is using identity multihashes for DAG roots, and so it has a large limit, however, something like Kubo would almost certainly support only up to the largest non-identity multihash to mitigate attack vectors.

Just to discuss your idea further, you want to impose a 32 link maximum per identity CID, however, what would stop people from using identity CIDs as links, with further links in them, to go past the 32 link limit? I guess the 2048 bytes maximum would prevent it going further, but, it's something I wanted to bring up.

@rvagg
Copy link
Member

rvagg commented Sep 2, 2022

Just to discuss your idea further, you want to impose a 32 link maximum per identity CID, however, what would stop people from using identity CIDs as links, with further links in them

Well, I added support for recursive identity CIDs here: filecoin-project/go-fil-markets@ffdf335

Because the recursive has to be inside the parent identity CID, you hit the byte limit pretty quick, and the total 32-link max is taken over the parent + recursives it contains. Any identity CID that an identity CID contains must be included in it, so the case is covered fairly neatly here.

But to be clear - recursive identity CIDs are kind of insane, and even supporting them seems a little insane.

There's some additional weirdness when it comes to Filecoin re identity CIDs that make it a tricky topic, mainly focused around the relationship between a DAG, the blocks in that DAG and the "piece" it's stored in. When you have an opaque blockstore that just returns blocks to you when you ask for them then a lot of those problems go away. But with Filecoin there are other factors that come in to play, so it's not opaque - do I have to unseal the piece to get you this block/DAG? do I have to charge you to send you this block/DAG? And identity CIDs essentially become identifiers—even if Filecoin were to just treat them as free retrievals, (a) do we want to force storage providers to answer random queries for "give me this identity CID"? and (b) what about DAGs that hang off such identity CIDs when you want to do a graph-based retrieval or when you're using an identity CID to identify some piece that you want to do a complete piece retrieval of.

Ugh. It's a mess and leaves me believing identity CIDs were a mistake.

@ribasushi
Copy link

But to be clear - recursive identity CIDs are kind of insane

This is objectively incorrect: any competently written dag assembler can naturally produce recursively-stacked identity CIDs: the input into the "decision function" of whether to wrap an identity is nothing but the size of the "block" based on the immediate linklist.

leaves me believing identity CIDs were a mistake.

None of your current troubles are specific to identity CIDs as a concept: everything stems from the fact that folks were/are trying to be clever about handling them in a special way through the stack. That is: identity CIDs are not special in any way. Treating them as such outside of transport is the actual grave mistake we should stop perpetuating.

@Winterhuman
Copy link

Winterhuman commented Sep 3, 2022

I do have one question about inline CIDs. What's the difference between setting a link to bafkqai vs having a "bytes": section in the DAG object? For example:

"Links" : [
   {
      "Hash" : {
         "/" : "bafkqailumvzxi2lom4qgm2lmmvsho2dbmrug6ylxnfugicq"
      },
      "Name" : "file1",
      "Tsize" : 25
   },
   {
      "bytes" : {
         "/" : "B43512VB124G42B4YT14GV14TGF1422 ... "
      },
      "Name" : "file2",
      "Tsize" : 36
   }
]

Is the difference practical? Conceptual? It's the one thing I've not quite understood when reading the whole conversation

@rvagg
Copy link
Member

rvagg commented Sep 5, 2022

@Winterhuman in your example you have what looks like a dag-pb conforming block printed in dag-json. One reason to do bytes in a CID is that dag-pb is so strict with what it can include, it only has one place for bytes, the Data field, and everything else should be a link inside the Links list. So if you were to try and pass your data through a dag-pb encoder it'll fail. But you could sneak those bytes in via an identity CID and have them resolve to what would essentially be a raw block that appears to be a separate entity.

But outside of dag-pb, the question still stands—why use an identity CID to smuggle bytes if you could just include bytes? I suppose it comes down to your desired schema, like with dag-pb, perhaps you need something to be a link in that position, but instead of linking to a tiny raw block, you just inline it and save some space. The same could be true with other data forms too—maybe your application's schema wants a link to a block containing a list, but you have an empty list, encoding that as a dag-cbor empty list would be a very small identity CID and you'd end up saving space doing so (even with the de-dupe you might get from needing to do it multiple times).

There are other cases too, like this one that's always slightly puzzled me: https://github.com/filecoin-project/specs-actors/blob/d8d9867f68a3c299295efdc6d1b3421c9b63df57/actors/builtin/codes.go - the identifiers for the builtin actors for Filecoin are identity CIDs containing the versioned name of the actor as bytes in a raw block. My guess is that we've always anticipated needing to identify actors by the hash of their code (wasm), so since we're going to need links in that position, let's just make them links from the start.

@rvagg
Copy link
Member

rvagg commented Sep 5, 2022

This post is for extrapolation of one of the reasons why I think we should discourage the use of identity CIDs in Filecoin (so I can link to this from a lotus discussion). This is mainly relevant for online deal flow, and mainly influenced by the historical precedent we set by including inline CIDs as their own block entries in CARs, although there are other reasons why it's not necessarily a great idea.

When we include CIDs as separate blocks in CARs, the block data we're inlining shows up 3 times: in the original block with the link, then as the CID for a CAR section and then as the data for that same CAR section. It'll show up another time for a CARv2 index, which will be generated for SPs actively serving retrievals, but I've not factored that in below.

We get de-duplication efficiencies if multiple blocks use the same inline CID but we'd get similar efficiencies if we were linking to the same block anyway, and beyond the size of a standard CID (~36 bytes) we rapidly end up losing any benefit.

Ignoring de-duplication effects, the "savings" can be visualised like this (only CIDv1), going up to the 126 bytes that Lotus will currently inline:

Screenshot 2022-09-05 at 8 54 58 pm

Storage cost increases ~linearly for storing the bytes as a separate linked block, with the linking CID and the CID:bytes CAR entry; i.e. the overhead is ~2 CIDs. For identity CIDs the increase is ~3x, with an overhead of 9 bytes. Break-even is at 32 bytes.

@rvagg
Copy link
Member

rvagg commented Oct 12, 2022

Is anyone in this thread going to squeal if we put a hard limit in both js-multiformats and go-cid, and also a suggestion in the spec, that would error if they are asked to decode (or encode) an identity CID greater than 2048 bytes? Try and ignore for a moment that special-casing one specific hash function is a bit weird, would you actually care of this was done? If not, what alternative limit would make you happier?

@ribasushi
Copy link

As a frequent user of 0x00 I would not mind, and this is the limit (2k) I originally proposed to @masih when they were doing the indexer inclusion thing. My only OCD request is to limit the payload ( the "hash" part ) to 2k instead of the entire MH/CID: i.e. make sure that no encoded varint gets to count towards these 2048 bytes.

@ianopolous
Copy link
Member

Yep no problem. We've long since migrated off identity hashes for anything but small public keys, and we hard limit them to 36 bytes there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants