-
Notifications
You must be signed in to change notification settings - Fork 108
codec: AES encrypted blocks #349
base: master
Are you sure you want to change the base?
Conversation
There’s a discussion that started in the Should we drop the CID length and just parse the CID out of the blocking by parsing through the varints? It would require some additional parsing rules and complicate things but it would also shave 4 bytes off of every block. |
Created multiformats/js-multiformats#60 to show what it could be like without the length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this IPLD Codec doesn't support the full Data Model, I'd like to see a section about which parts it supports and how it is encoded.
It looks like it only supports two Kinds (when we don't do the length prefix as @rvagg suggests), Bytes
and List
(when we change the representation of the struct to tuple
as I suggested in my other comment).
Bytes
a just the bytes themselves without any size prefix.List
is just its values concatenated also without any size prefixing, neither for the number of elements, nor for the individual items.
block-layer/codecs/aes.md
Outdated
type AES struct { | ||
iv Bytes | ||
bytes Bytes | ||
} representation map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using representation tuple
would be clearer. I know that a codec can decide how to encode a Data Model Map, but just concatenating two byte streams reminds me conceptually more of a an array/tuple than of a map. Same for the DecryptBlock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This describes what the parsed data model state is, not the encoded binary. It should be a map, not a tuple, because even without a schema being applied it’s a map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to think about that quite a bit, now I get it. It makes sense. Then this codec only supports Bytes
and Map
s (so replace List
with Map
in my other comment).
There’s actually only byes and a map that come out of the codec. The length is never surfaced, that’s just part of the block format. |
This codec only "supports" |
Thanks @rvagg, I know get the point that this codec cannot encode any arbitrary Data Model Map. The use of Schemas here confused me at first, but pointing to DAG-PB made me realize that we do the exact same thing there too. |
block-layer/codecs/aes.md
Outdated
bytes Bytes | ||
} representation map | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grabbing this line to start a thread in GitHub so as not to clob up the PR if people aren't interested. I know it's a bit long, but the Slack thread that generated this was much longer 😄.
I've mentioned this offline, but am adding my objections to these codecs here. I'm posting here instead of in the multicodec repo since, as @rvagg has mentioned a number of times before, in order to keep the code table open we shouldn't be gatekeepers on whether a code is "good enough" to get in the table. Instead we should just nudge people towards design patterns that are likely to work in the ecosystem and figure out what value the codes should get in the table (as well as prevent squatting lots of codes).
Summary
Overall, my issues with this proposal are that when people request nominative types in IPLD (or failing that in the code table) we have (and do) push back by asserting that it should be handled at the application layer instead of the data layer, except for here where there's a particular built-in exception.
Specifics
- The codecs in this proposal do not meet the definition of codec in the spec
Codecs serve as an intermediary between raw bytes and the IPLD Data Model. They determine how data is converted to and from the Data Model.
aes-gcm and aes-ctr have the same serialized raw bytes and IPLD Data Model format, which means they can have the same codec. Giving additional codes for the same data format is something we regularly push back on.
This means we really only have two codecs and yet have defined three.
- Application types (unless my understanding is out of date/incorrect) are supposed to be defined on top of Data Model types not serialization types. This means that any code reasonably expecting
type AES struct {
iv Bytes
bytes Bytes
} representation map
should be able to cope whether the data is serialized in the format described here, as CBOR, or any other serialization format. However, to reach the same behaviors that the priviledged aes-* codecs get would require reserving another code in the code table (e.g. aes-gcm-cbor) which is both a large hurdle and would eat up table entries making these the priviledged codecs.
This spec even asserts that the application layer must be involved to get the encryption/decryption keys so why not just use normal codecs and put determining if it's aes-{gcm,cbc,ctr} in the application layer?
Alternatives
Implement a scheme based off of one of the proposals for nominative types in IPLD, such as #71.
For example, define the AES struct as:
type AES struct {
@type optional Code
iv Bytes
bytes Bytes
} representation map
While the type could be represented in any number of ways, but we can keep things similar to this proposal by defining aes-{gcm,cbc,ctr} in the code table as abstract definitions of ciphers (as opposed to this particular implementation).
If we thought it was worth saving a few bytes on the wire we could even make a custom codec (short-aes-encrypted-data) varint(typeCodeSize)| typeCode | uint8(iv size) | iv | bytes
(we could even save a bit more by just asserting all typeCodes are 2 bytes).
So for the cost of <5 bytes per encrypted blob we now have replaceable formats and an example of how people who need nominative types can implement them.
If we think this case has such special requirements that it deserves to be special cased then I guess that's the price being paid, but IMO we should at least be explicit about it. For example are we're claiming those bytes are just that valuable? Alternatively, do we think the out-of-band signaling is so specially useful even though the application layer is needed to figure out what keys to use anyhow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the alternative. In regards to bytes overhead, I don't know enough about type codes, but looking at this PR it seems that sizes are fixed for a specific type code, so you would only need 1 byte per type code and infer the size from it (which isn't very self-describing, so I'm not saying we should do this, but I think we could if needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't want to derail the conversation too much further, but there were some lessons learned in this work: https://github.com/ipld/specs/blob/5b8f87ffa942f0ce30d53f799d49da1814f1273d/block-layer/codecs/dag-jose.md. Certainly, that is higher level than what is proposed here, but we might be able to borrow some concepts from: https://tools.ietf.org/html/rfc7518, at least in terms of @type definitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The codecs in this proposal do not meet the definition of codec in the spec
We probably need to revise this language quite a bit. Our specs lean heavily toward describing how IPLD works with generic codecs that support the full data model. It does a poor job of describing codecs like this and codecs like bitcoin, git, etc. Codecs that have a data model representation but do not serialize arbitrary data model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecs that have a data model representation but do not serialize arbitrary data model.
I think it still hold true. Codecs still describe conversions from and to the Data Model, even if it isn't the full Data Model and doesn't even support arbitrary things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don’t have language that excludes these sorts of codecs, but we were just more concerned with full data model codecs when we wrote a lot of these specs and so the language is often biased in that direction.
block-layer/codecs/aes.md
Outdated
} representation map | ||
``` | ||
|
||
## Decrypted Block Format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's just me missing/misunderstanding something, but this section seems like it isn't related to the aes-* codec specs.
Is there a codec that turns
| uint32(CID byteLength) | CID | Bytes
into
type DecryptedBlock struct {
cid Link
bytes Bytes
} representation map
If so, what is it called and where is it used/referenced?
Is aes-gcm encoded data that properly satisfies 12 byte IV | data bytes
deemed invalid data if when decrypted the data does not match the serialized decrypted block format? If so then this gives us an IPLD format whose correctness is impossible to verify below the application layer which stands outside the norms (although my understanding is the blockchain groups have been doing this so it's not unheard of).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We’re in a bit of a grey area as to what “is” and “is not” part of the codec.
Sure, you could say that anything that falls outside the encode/decode function “is not” part of the codec. However:
- we expect encrypt/decrypt programs to ship with the codec
- the codec is intended to wrap an IPLD block (CID and Bytes) which is a consistent concept across IPLD
However, I DO NOT want to overly formalize the representation of the block because, ideally, what is input/output for encrypt/decrypt functions would align with the IPLD library in that language and map to how that library handles blocks, which is NOT formalized and there are big differences between implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get what you're saying, although I wonder if we should be disambiguating a bit in this doc by saying what implementers of the codec MUST do (e.g. follow the encoding/decoding rules for the encrypted block format) and what users of the codec SHOULD do (encrypt data of the form presented here, i.e. | CID | Bytes |
).
I wanted to give some feedback on this too, as it parallels a lot of the work we've been doing with dag-cose (and things that dag-jose already addresses kinda) What I like about this proposalI really like how simple this is, it's a nice low level primitive to built more complex structures upon. What I don't like about this proposalMy main concern is, that this proposal is way too easy to misuse by developers who don't know better. My seconds concern is that this is basically reinventing the wheel, we already have battle tested standards such as JOSE and COSE that cover the same area these codec are covering. That said, please don't feel offended, this proposal is definitely a step in the right direction, I just think this isn't our holy grail quite yet. |
I don’t quite understand the concern here. What IPLD currently offers for encryption is nothing. Everyone doing encryption is doing it in the application layer above IPLD. What this spec does is offer very small primitives to help those projects along without changing the layer model of IPLD or forcing a particular encryption workflow on IPLD (which just wouldn’t work). Just looking at where IPLD lives in the stack, it’s hard to imagine how we would add more than this.
These standards may seem small to you because you’re already using them, but for people who haven’t already fully adopted these standards they are quite large and contain a lot of opinions and other decisions that don’t make a lot of sense to other workflows. I think those codecs will still be popular even while these ones occupy a similar space because those standards already have some adoption, but having spent time adding encryption to an IPLD application I can comfortably say that they are a lot more than is necessary and would be a barrier to adoption if they were the only way to do encryption in IPLD. I also don’t think they are necessarily in conflict at all given the fact that these AES codecs don’t address signing whatsoever. |
Yeah I agree, as I've said in my talk I also think that COSE is not a good fit for IPLD, for various reasons. So anyway, I agree with you that low-level, small objects are a better fit for the composable nature of IPLD, +1 from me. Maybe you can add a security guidelines section though, for example never reuse keys, only use secure algorithms etc.? |
We captured some of this in exploration reports but we really need a larger and more accessible document on encryption workflows that can cover this sort of thing. |
This is primarily in response to @aschmahmann but it’s a little broader than the scope of the thread it’s in so I’m doing a top level post about it. In responding to another thread it became clear to me where I’m drawing the line between the codec identifier being a type identifier vs just a block format identifier. Depending on your perspective the entire multicodec table is a type system. Those “types” are tied directly to block formats which then normalize to a Data Model representation. However, it is clearly true that the codec identifier is providing more than just a parser hint and there are numerous examples where we use the codec identifier to provide additional type information beyond the data model representation. We do this w/ bitcoin, eth, git, etc. Those codecs mean a little more than “this is the block format”, they also signal what application produced those blocks and that application will do additional typing on that block data than IPLD will do in just the Data Model representation. I don’t think it should be our goal to avoid muticodecs being used for type identification systems. But I do think it should be our goal to avoid multicodecs being used as the primary type identification system. In other words, multicodecs should be used somewhat liberally to describe type systems rather than describing all the types within a system. If Adin wants to write a new type system on IPLD, he should ask for one new multicodec. That should correspond to a block format that describes his types and produces a data model representation of that information while also acting as a signal that this data will mean more when handed over to Adin’s type system. That block format may literally just be dag-cbor, I don’t think it’s worth producing formal rules about format re-use. Given these rules, I think the following spec changes are warranted.
|
Here's the way I think of it:
The reasons are of course that is better to keep one-off parsing/validation logic out of IPFS and everything else using IPLD. However, if we do a reductio ad absurdum on that principle alone, we end up with there should be no multicodecs (or rather, just 1), and we always get the raw bytes out. Clearly that is too extreme. How can we fix this? I think with the following principle:
IPFS about the graph structure, nothing more, nothing less. Raw bytes per the above give us no child links, and thus no graph structure. This is ugly, and in particular it rules out graphsync, GC and pinning, and all the things that make IPFS a step above BitTorrent and other similar antecedents. Combine these two, and we get that multicodecs should expose just enough structure to allow recovering all child links, but no more, and I think that is a good tight constraint on the design space. |
Big spec update to bring it inline with my last comment. Collapsed into a single block format that describes the cipher and iv length in the block format. |
|
||
```ipldsch | ||
type AES struct { | ||
code Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I would've chosen to try experimenting with @type
instead of code
to see how it feels since it might encourage library development that takes nominative typing into account (e.g. #71 uses @type: codecNumber
as an example).
However, I definitely understand just wanting to roll something out for encryption without worrying about longer term ramifications or bikeshedding on what @type
could look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code
has been how we typically refer to multicodecs in a variety of locations. in this case, these all correspond to multicodecs so it makes sense.
this representation ends up being kind of important in js-multiformats
because it consistently destructures to the correct property name across the system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very potäto/potato to me. I sincerely doubt making a field called "@type" instead of "code" here is going to influence anyone's hypothetical future library development in any clear direction.
Also: "@type" would not be syntactically valid IPLD Schema syntax: symbols are not permitted in field names. And for this, the reason in turn is: suppose someone wants to feed this into codegenerators: that "@" isn't going to be a valid field name in most programming languages under the sun, either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm likely missing lots of context on js-multiformats
but
because it consistently destructures to the correct property name across the system
is the point I was getting at with @type
, the more structures adopt a convention like that this the easier it becomes to reuse it in the future. (e.g. if @type
defined could be defined as a union of { Int | String | CID ... } similar to the proposal ).
That block format may literally just be dag-cbor, I don’t think it’s worth producing formal rules about format re-use.
(from @mikeal's comment #349 (comment))
I disagree, I think separating the block layer from the application layer entirely is super useful. That means minimizing special semantics associated with codec names as much as possible such that they are only used to decode the bytes and get IPLD Data Model output.
Also:
@type
would not be syntactically valid IPLD Schema syntax: symbols are not permitted in field names. And for this, the reason in turn is: suppose someone wants to feed this into codegenerators: that "@" isn't going to be a valid field name in most programming languages under the sun, either.
That's interesting and seems to push even more towards building libraries where a field like @type
is considered special since it's a valid data model field name but not a valid schema name, and we've been pushing people towards describing their data with schemas. i.e. doing this in application space would be a pain since it's not valid schema syntax and it is a generically useful feature -> do it below the application space?
btw I know we've gone at this a bunch already recently and while I'm happy to continue, consider my comment "However, I definitely understand just wanting to roll something out for encryption..." license to just say "I disagree, but we can go at this another time" 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll try to clarify.
In js-multiformats
there are tools for defining and working with codecs, hashers, base encoders and now encryption/decryption programs.
In all cases the “lookup” for a given implementation is done with an integer code. Our implementations of CID and multihash include a code
property that matches a code
property on codec and hash implementations. Looking up a given implementation is easy and consistent since you can do const lookup = ({ code }) => { /* return implementation */ }
and have it work consistently across all multiformats.
That’s why it’s so useful to have the data model representation use a code
property because we can then lookup the decryption program in multiformats by just passing the return value from decode()
directly to a lookup function that already works with CIDs and multihashes.
We aligned all of these in the last major refactor when we moved to integer codes in order to avoid needing to ship every codec and base encoder, and the multiformats table, in order to support string names for codecs and hashers.
block-layer/codecs/aes.md
Outdated
} representation map | ||
``` | ||
|
||
## Decrypted Block Format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get what you're saying, although I wonder if we should be disambiguating a bit in this doc by saying what implementers of the codec MUST do (e.g. follow the encoding/decoding rules for the encrypted block format) and what users of the codec SHOULD do (encrypt data of the form presented here, i.e. | CID | Bytes |
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments/questions it'd be great to work through a bit, but overall I think this modified proposal has addressed my biggest concern (having the application layer extract application type information from the codec) and it looks good 👍. Thanks @mikeal
format is simply the iv followed by the byte payload. | ||
|
||
``` | ||
| varint(cipher-multicodec) | varint(ivLength) | iv | bytes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that any justification is needed to propose a new block format, but what is the context for using this over cbor?
Is it because we think encryption is so useful and common that shaving the bytes off the field name is worthwhile? Is it because we think this format is so much simpler to implement than cbor that it can be used across more of the ecosystem? Something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most obvious reason is that it’s substantially smaller than CBOR. There’s no reason for us to leverage large general purpose block formats when jamming a few buffers and/or varints together will work.
We should never assume that CBOR or DagCBOR are already available. These formats are not that widely used outside of our bubble and are nowhere near the adoption level that I’d consider we can “take for granted” like we might JSON.
When we leverage an existing codec we then also have to do schema validation on the input and output in order to ensure determinism since the generic codec will allow any number of generic properties to be there in addition to what we’ve defined in the spec.
If you can write a codec in a few lines of code for a new data structure that is already its own multicodec identifier we shouldn’t shy away from doing just that. We avoid all the caveats and concerns about determinism when we write self-describing types as an ordered concatenation of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I figured 👍. The one caveat I have is:
If you can write a codec in a few lines of code for a new data structure that is already its own multicodec identifier we shouldn’t shy away from doing just that.
I think we should still shy away from it since I'd prefer the number of codecs to be smaller rather than bigger as it puts pressure on generalized systems like IPFS to get bigger and bigger as they need to add support for more and more codecs. They also take up slots in the code table and arguably any block format could want a small code number because people could be storing tons of blocks and could then easily save some space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good question. "why would we prefer x over cbor?" is almost always a valid question (so much so that we should almost put it in a checklist for proposals!), and this context is no different.
In this case: I could see users remarking on the bytes shaved off as useful, yes. And the simplicity also does seem potentially relevant: that this format will never, ever need to be capable of any kind of recursion, since it's just a small wrapper for the ciphertext, does make it possible to implement with considerably simpler mechanisms than cbor.
I don't think either of those arguments is open-and-shut, but they're enough to make me receptive to a new block format, at least.
@@ -47,25 +48,29 @@ payload. Since the initializing vector is a known size for each AES variant the | |||
format is simply the iv followed by the byte payload. | |||
|
|||
``` | |||
| iv | bytes | | |||
| varint(cipher-multicodec) | varint(ivLength) | iv | bytes | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an argument about using a custom block format because of it's size compared to e.g. DAG-CBOR.
The current proposal is | varint(cipher-multicodec) | varint(ivLength) | iv | bytes |
with the currently choosen multicodecs that's 3 bytes overhead compared to the actual data.
As a custom format needs custom code anyway, you could just hard-code the lengths of the iv
based on the cipher and use an enum (and not a multicodec for the cipher type), so you end up with 1 byte overhead.
If you would use DAG-CBOR, when using an enum (and not multicodecs) for the ciphers and using list representation the overhead would be 7 bytes.
I'd go for either the 1 byte overhead or codec independent (e.g. DAG-CBOR), but I don't have a strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting point about needing custom code anyway. Although it depends on where in the stack this custom code lives. If a new cipher is added you'd now have to change the codec code which is a little awkward.
For example, if we added support for this codec in go-ipfs but had not added support for a new cipher. With the extra bytes some python or js client could happily fetch the data from go-ipfs and decrypt with their custom cipher without any problems, by removing those bytes now the user needs to send a PR to go-ipfs to modify the codec to support their cipher.
| varint(cipher-multicodec) | varint(ivLength) | iv | bytes | | ||
``` | ||
|
||
This is decoded into IPLD data model of the following schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put a title in here, or a just inline the wording that says that this is the "Logical Format"? We use that terminology in the DAG-PB spec to make it distinct from the wire format and I think it's a really good framing for these schemas that talk about how we instantiate data forms out of a soup of bytes that could be interpreted in any number of ways.
| CID | Bytes | ||
``` | ||
|
||
The decrypted state is decoded into IPLD data model of the following schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment re "Logical Format"
Some questions I have for crypto-heads:
As per multiformats/multicodec#202 (comment) I've also proposed that we add keylength to the AES cipher entries in the multicodec table, so you'd choose |
OK, I had a brief discussion with @nikkolasg about this and did some more thinking and researching and here's my current position:
Mostly though, I think the format is fine for now, we can add a new multicodec for an extended-encryption if we need it later. |
What happened here? Are these spec changes we should merge as specs, or are they things we should keep in exploration report territory until further ratified and have more implementations? Who's working on it? I'd love to land some of the data here, whether it's as fully-finished-and-ratified specs, or architecture design records, or exploration reports, I don't really care, I just want to get some more stuff out of the "open PRs" lane :) |
Stalled @ multiformats/js-multiformats#59 but super close. I know Textile are interested in trying to use this so we could push it over the line. Project proposal @ protocol/web3-dev-team#49 to get it wrapped up and my estimation is that it's fairly low investment to do. |
Here’s an initial spec for the AES codec work I’ve done https://github.com/multiformats/js-multiformats/pull/59/files