Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

codec: AES encrypted blocks #349

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

codec: AES encrypted blocks #349

wants to merge 6 commits into from

Conversation

mikeal
Copy link
Contributor

@mikeal mikeal commented Jan 13, 2021

Here’s an initial spec for the AES codec work I’ve done https://github.com/multiformats/js-multiformats/pull/59/files

@mikeal mikeal changed the title init: initial spec for AES codec init: initial spec for AES codecs Jan 13, 2021
@mikeal
Copy link
Contributor Author

mikeal commented Jan 13, 2021

There’s a discussion that started in the js-multiformats implementation that we should move to this spec multiformats/js-multiformats#59 (comment)

Should we drop the CID length and just parse the CID out of the blocking by parsing through the varints? It would require some additional parsing rules and complicate things but it would also shave 4 bytes off of every block.

@mikeal mikeal changed the title init: initial spec for AES codecs codec: AES encrypted blocks Jan 13, 2021
@rvagg
Copy link
Member

rvagg commented Jan 14, 2021

Created multiformats/js-multiformats#60 to show what it could be like without the length.

Copy link
Member

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this IPLD Codec doesn't support the full Data Model, I'd like to see a section about which parts it supports and how it is encoded.

It looks like it only supports two Kinds (when we don't do the length prefix as @rvagg suggests), Bytes and List (when we change the representation of the struct to tuple as I suggested in my other comment).

  • Bytes a just the bytes themselves without any size prefix.
  • List is just its values concatenated also without any size prefixing, neither for the number of elements, nor for the individual items.

type AES struct {
iv Bytes
bytes Bytes
} representation map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using representation tuple would be clearer. I know that a codec can decide how to encode a Data Model Map, but just concatenating two byte streams reminds me conceptually more of a an array/tuple than of a map. Same for the DecryptBlock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This describes what the parsed data model state is, not the encoded binary. It should be a map, not a tuple, because even without a schema being applied it’s a map.

Copy link
Member

@vmx vmx Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to think about that quite a bit, now I get it. It makes sense. Then this codec only supports Bytes and Maps (so replace List with Map in my other comment).

@mikeal
Copy link
Contributor Author

mikeal commented Jan 14, 2021

There’s actually only byes and a map that come out of the codec. The length is never surfaced, that’s just part of the block format.

@rvagg
Copy link
Member

rvagg commented Jan 18, 2021

This codec only "supports" Bytes, nothing else, in the same way that dag-pb only supports Bytes and Links. The List, Map, etc. are only artifacts of how it decodes into the Data Model, they can't be used to encode any other forms.

@vmx
Copy link
Member

vmx commented Jan 18, 2021

Thanks @rvagg, I know get the point that this codec cannot encode any arbitrary Data Model Map. The use of Schemas here confused me at first, but pointing to DAG-PB made me realize that we do the exact same thing there too.

bytes Bytes
} representation map
```

Copy link

@aschmahmann aschmahmann Jan 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grabbing this line to start a thread in GitHub so as not to clob up the PR if people aren't interested. I know it's a bit long, but the Slack thread that generated this was much longer 😄.

I've mentioned this offline, but am adding my objections to these codecs here. I'm posting here instead of in the multicodec repo since, as @rvagg has mentioned a number of times before, in order to keep the code table open we shouldn't be gatekeepers on whether a code is "good enough" to get in the table. Instead we should just nudge people towards design patterns that are likely to work in the ecosystem and figure out what value the codes should get in the table (as well as prevent squatting lots of codes).

Summary

Overall, my issues with this proposal are that when people request nominative types in IPLD (or failing that in the code table) we have (and do) push back by asserting that it should be handled at the application layer instead of the data layer, except for here where there's a particular built-in exception.

Specifics

  1. The codecs in this proposal do not meet the definition of codec in the spec

Codecs serve as an intermediary between raw bytes and the IPLD Data Model. They determine how data is converted to and from the Data Model.

aes-gcm and aes-ctr have the same serialized raw bytes and IPLD Data Model format, which means they can have the same codec. Giving additional codes for the same data format is something we regularly push back on.

This means we really only have two codecs and yet have defined three.

  1. Application types (unless my understanding is out of date/incorrect) are supposed to be defined on top of Data Model types not serialization types. This means that any code reasonably expecting
type AES struct {
  iv Bytes
  bytes Bytes
} representation map

should be able to cope whether the data is serialized in the format described here, as CBOR, or any other serialization format. However, to reach the same behaviors that the priviledged aes-* codecs get would require reserving another code in the code table (e.g. aes-gcm-cbor) which is both a large hurdle and would eat up table entries making these the priviledged codecs.

This spec even asserts that the application layer must be involved to get the encryption/decryption keys so why not just use normal codecs and put determining if it's aes-{gcm,cbc,ctr} in the application layer?

Alternatives

Implement a scheme based off of one of the proposals for nominative types in IPLD, such as #71.

For example, define the AES struct as:

type AES struct {
  @type optional Code
  iv Bytes
  bytes Bytes
} representation map

While the type could be represented in any number of ways, but we can keep things similar to this proposal by defining aes-{gcm,cbc,ctr} in the code table as abstract definitions of ciphers (as opposed to this particular implementation).

If we thought it was worth saving a few bytes on the wire we could even make a custom codec (short-aes-encrypted-data) varint(typeCodeSize)| typeCode | uint8(iv size) | iv | bytes (we could even save a bit more by just asserting all typeCodes are 2 bytes).

So for the cost of <5 bytes per encrypted blob we now have replaceable formats and an example of how people who need nominative types can implement them.

If we think this case has such special requirements that it deserves to be special cased then I guess that's the price being paid, but IMO we should at least be explicit about it. For example are we're claiming those bytes are just that valuable? Alternatively, do we think the out-of-band signaling is so specially useful even though the application layer is needed to figure out what keys to use anyhow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the alternative. In regards to bytes overhead, I don't know enough about type codes, but looking at this PR it seems that sizes are fixed for a specific type code, so you would only need 1 byte per type code and infer the size from it (which isn't very self-describing, so I'm not saying we should do this, but I think we could if needed).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't want to derail the conversation too much further, but there were some lessons learned in this work: https://github.com/ipld/specs/blob/5b8f87ffa942f0ce30d53f799d49da1814f1273d/block-layer/codecs/dag-jose.md. Certainly, that is higher level than what is proposed here, but we might be able to borrow some concepts from: https://tools.ietf.org/html/rfc7518, at least in terms of @type definitions?

Copy link
Contributor Author

@mikeal mikeal Jan 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codecs in this proposal do not meet the definition of codec in the spec

We probably need to revise this language quite a bit. Our specs lean heavily toward describing how IPLD works with generic codecs that support the full data model. It does a poor job of describing codecs like this and codecs like bitcoin, git, etc. Codecs that have a data model representation but do not serialize arbitrary data model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codecs that have a data model representation but do not serialize arbitrary data model.

I think it still hold true. Codecs still describe conversions from and to the Data Model, even if it isn't the full Data Model and doesn't even support arbitrary things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don’t have language that excludes these sorts of codecs, but we were just more concerned with full data model codecs when we wrote a lot of these specs and so the language is often biased in that direction.

} representation map
```

## Decrypted Block Format
Copy link

@aschmahmann aschmahmann Jan 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's just me missing/misunderstanding something, but this section seems like it isn't related to the aes-* codec specs.

Is there a codec that turns

| uint32(CID byteLength) | CID | Bytes

into

type DecryptedBlock struct {
  cid Link
  bytes Bytes
} representation map

If so, what is it called and where is it used/referenced?

Is aes-gcm encoded data that properly satisfies 12 byte IV | data bytes deemed invalid data if when decrypted the data does not match the serialized decrypted block format? If so then this gives us an IPLD format whose correctness is impossible to verify below the application layer which stands outside the norms (although my understanding is the blockchain groups have been doing this so it's not unheard of).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’re in a bit of a grey area as to what “is” and “is not” part of the codec.

Sure, you could say that anything that falls outside the encode/decode function “is not” part of the codec. However:

  1. we expect encrypt/decrypt programs to ship with the codec
  2. the codec is intended to wrap an IPLD block (CID and Bytes) which is a consistent concept across IPLD

However, I DO NOT want to overly formalize the representation of the block because, ideally, what is input/output for encrypt/decrypt functions would align with the IPLD library in that language and map to how that library handles blocks, which is NOT formalized and there are big differences between implementations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you're saying, although I wonder if we should be disambiguating a bit in this doc by saying what implementers of the codec MUST do (e.g. follow the encoding/decoding rules for the encrypted block format) and what users of the codec SHOULD do (encrypt data of the form presented here, i.e. | CID | Bytes |).

@JonasKruckenberg
Copy link

I wanted to give some feedback on this too, as it parallels a lot of the work we've been doing with dag-cose (and things that dag-jose already addresses kinda)

What I like about this proposal

I really like how simple this is, it's a nice low level primitive to built more complex structures upon.
I also like that this does not add a lot of overhead (both processing and storage wise) to the block.
Something that worries me with dag-cose.

What I don't like about this proposal

My main concern is, that this proposal is way too easy to misuse by developers who don't know better.
This codec offloads all of the actually security relevant decisions to the application,
while I get that there are good reasons for processing the block at the application this also pushes ALL the responsibility to userland. So in short what does this codec offer that I can't already archive with and identity codec?

My seconds concern is that this is basically reinventing the wheel, we already have battle tested standards such as JOSE and COSE that cover the same area these codec are covering.

That said, please don't feel offended, this proposal is definitely a step in the right direction, I just think this isn't our holy grail quite yet.
I think we're on to something with this proposal though!

@mikeal
Copy link
Contributor Author

mikeal commented Jan 19, 2021

this codec offloads all of the actually security relevant decisions to the application

I don’t quite understand the concern here. What IPLD currently offers for encryption is nothing. Everyone doing encryption is doing it in the application layer above IPLD. What this spec does is offer very small primitives to help those projects along without changing the layer model of IPLD or forcing a particular encryption workflow on IPLD (which just wouldn’t work).

Just looking at where IPLD lives in the stack, it’s hard to imagine how we would add more than this.

we already have battle tested standards such as JOSE and COSE that cover the same area these codec are covering

These standards may seem small to you because you’re already using them, but for people who haven’t already fully adopted these standards they are quite large and contain a lot of opinions and other decisions that don’t make a lot of sense to other workflows.

I think those codecs will still be popular even while these ones occupy a similar space because those standards already have some adoption, but having spent time adding encryption to an IPLD application I can comfortably say that they are a lot more than is necessary and would be a barrier to adoption if they were the only way to do encryption in IPLD.

I also don’t think they are necessarily in conflict at all given the fact that these AES codecs don’t address signing whatsoever.

@JonasKruckenberg
Copy link

Yeah I agree, as I've said in my talk I also think that COSE is not a good fit for IPLD, for various reasons.
The overhead of COSE (and JOSE too for that matter) are significant and that's one of the reasons.
I've just seen a lot of people make a lot of poor security choices either because they didn't know better or because they had to cut corners somewhere. This is something that really worries me and that we should keep in mind that's all.

So anyway, I agree with you that low-level, small objects are a better fit for the composable nature of IPLD, +1 from me.

Maybe you can add a security guidelines section though, for example never reuse keys, only use secure algorithms etc.?

@mikeal
Copy link
Contributor Author

mikeal commented Jan 19, 2021

Maybe you can add a security guidelines section though, for example never reuse keys, only use secure algorithms etc.?

We captured some of this in exploration reports but we really need a larger and more accessible document on encryption workflows that can cover this sort of thing.

@mikeal
Copy link
Contributor Author

mikeal commented Jan 20, 2021

This is primarily in response to @aschmahmann but it’s a little broader than the scope of the thread it’s in so I’m doing a top level post about it.

In responding to another thread it became clear to me where I’m drawing the line between the codec identifier being a type identifier vs just a block format identifier.

Depending on your perspective the entire multicodec table is a type system. Those “types” are tied directly to block formats which then normalize to a Data Model representation. However, it is clearly true that the codec identifier is providing more than just a parser hint and there are numerous examples where we use the codec identifier to provide additional type information beyond the data model representation. We do this w/ bitcoin, eth, git, etc. Those codecs mean a little more than “this is the block format”, they also signal what application produced those blocks and that application will do additional typing on that block data than IPLD will do in just the Data Model representation.

I don’t think it should be our goal to avoid muticodecs being used for type identification systems. But I do think it should be our goal to avoid multicodecs being used as the primary type identification system.

In other words, multicodecs should be used somewhat liberally to describe type systems rather than describing all the types within a system.

If Adin wants to write a new type system on IPLD, he should ask for one new multicodec. That should correspond to a block format that describes his types and produces a data model representation of that information while also acting as a signal that this data will mean more when handed over to Adin’s type system. That block format may literally just be dag-cbor, I don’t think it’s worth producing formal rules about format re-use.

Given these rules, I think the following spec changes are warranted.

  • A unified “Encrypted Block” specification for encoding blocks wrapped in encryption that include an initializing vector.
  • This new “Encrypted Block” format would contain within it a multicodec identifier for aes-gtm. AES flavors are common enough that that should be in the multicodec table anyway and these identifiers can be re-used elsewhere in other cryptography systems for identifying the cipher. The spec would no longer be defining the meaning of the aes multicodecs if/when they are used as the codec in a CID. In other words, they won’t point specifically at this block format.

@Ericson2314
Copy link

Here's the way I think of it:

  • If it's possible to have an "untyped" representation that uses fewer multicodecs, this is preferred.

The reasons are of course that is better to keep one-off parsing/validation logic out of IPFS and everything else using IPLD. However, if we do a reductio ad absurdum on that principle alone, we end up with there should be no multicodecs (or rather, just 1), and we always get the raw bytes out. Clearly that is too extreme.

How can we fix this? I think with the following principle:

  • The multicodec should include enough information to recover all child links.

IPFS about the graph structure, nothing more, nothing less. Raw bytes per the above give us no child links, and thus no graph structure. This is ugly, and in particular it rules out graphsync, GC and pinning, and all the things that make IPFS a step above BitTorrent and other similar antecedents.

Combine these two, and we get that multicodecs should expose just enough structure to allow recovering all child links, but no more, and I think that is a good tight constraint on the design space.

@mikeal
Copy link
Contributor Author

mikeal commented Jan 20, 2021

Big spec update to bring it inline with my last comment. Collapsed into a single block format that describes the cipher and iv length in the block format.


```ipldsch
type AES struct {
code Int

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I would've chosen to try experimenting with @type instead of code to see how it feels since it might encourage library development that takes nominative typing into account (e.g. #71 uses @type: codecNumber as an example).

However, I definitely understand just wanting to roll something out for encryption without worrying about longer term ramifications or bikeshedding on what @type could look like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code has been how we typically refer to multicodecs in a variety of locations. in this case, these all correspond to multicodecs so it makes sense.

this representation ends up being kind of important in js-multiformats because it consistently destructures to the correct property name across the system

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very potäto/potato to me. I sincerely doubt making a field called "@type" instead of "code" here is going to influence anyone's hypothetical future library development in any clear direction.

Also: "@type" would not be syntactically valid IPLD Schema syntax: symbols are not permitted in field names. And for this, the reason in turn is: suppose someone wants to feed this into codegenerators: that "@" isn't going to be a valid field name in most programming languages under the sun, either.

Copy link

@aschmahmann aschmahmann Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm likely missing lots of context on js-multiformats but

because it consistently destructures to the correct property name across the system

is the point I was getting at with @type, the more structures adopt a convention like that this the easier it becomes to reuse it in the future. (e.g. if @type defined could be defined as a union of { Int | String | CID ... } similar to the proposal ).

That block format may literally just be dag-cbor, I don’t think it’s worth producing formal rules about format re-use.

(from @mikeal's comment #349 (comment))

I disagree, I think separating the block layer from the application layer entirely is super useful. That means minimizing special semantics associated with codec names as much as possible such that they are only used to decode the bytes and get IPLD Data Model output.

Also: @type would not be syntactically valid IPLD Schema syntax: symbols are not permitted in field names. And for this, the reason in turn is: suppose someone wants to feed this into codegenerators: that "@" isn't going to be a valid field name in most programming languages under the sun, either.

That's interesting and seems to push even more towards building libraries where a field like @type is considered special since it's a valid data model field name but not a valid schema name, and we've been pushing people towards describing their data with schemas. i.e. doing this in application space would be a pain since it's not valid schema syntax and it is a generically useful feature -> do it below the application space?

btw I know we've gone at this a bunch already recently and while I'm happy to continue, consider my comment "However, I definitely understand just wanting to roll something out for encryption..." license to just say "I disagree, but we can go at this another time" 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll try to clarify.

In js-multiformats there are tools for defining and working with codecs, hashers, base encoders and now encryption/decryption programs.

In all cases the “lookup” for a given implementation is done with an integer code. Our implementations of CID and multihash include a code property that matches a code property on codec and hash implementations. Looking up a given implementation is easy and consistent since you can do const lookup = ({ code }) => { /* return implementation */ } and have it work consistently across all multiformats.

That’s why it’s so useful to have the data model representation use a code property because we can then lookup the decryption program in multiformats by just passing the return value from decode() directly to a lookup function that already works with CIDs and multihashes.

We aligned all of these in the last major refactor when we moved to integer codes in order to avoid needing to ship every codec and base encoder, and the multiformats table, in order to support string names for codecs and hashers.

} representation map
```

## Decrypted Block Format

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you're saying, although I wonder if we should be disambiguating a bit in this doc by saying what implementers of the codec MUST do (e.g. follow the encoding/decoding rules for the encrypted block format) and what users of the codec SHOULD do (encrypt data of the form presented here, i.e. | CID | Bytes |).

Copy link

@aschmahmann aschmahmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments/questions it'd be great to work through a bit, but overall I think this modified proposal has addressed my biggest concern (having the application layer extract application type information from the codec) and it looks good 👍. Thanks @mikeal

format is simply the iv followed by the byte payload.

```
| varint(cipher-multicodec) | varint(ivLength) | iv | bytes |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that any justification is needed to propose a new block format, but what is the context for using this over cbor?

Is it because we think encryption is so useful and common that shaving the bytes off the field name is worthwhile? Is it because we think this format is so much simpler to implement than cbor that it can be used across more of the ecosystem? Something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most obvious reason is that it’s substantially smaller than CBOR. There’s no reason for us to leverage large general purpose block formats when jamming a few buffers and/or varints together will work.

We should never assume that CBOR or DagCBOR are already available. These formats are not that widely used outside of our bubble and are nowhere near the adoption level that I’d consider we can “take for granted” like we might JSON.

When we leverage an existing codec we then also have to do schema validation on the input and output in order to ensure determinism since the generic codec will allow any number of generic properties to be there in addition to what we’ve defined in the spec.

If you can write a codec in a few lines of code for a new data structure that is already its own multicodec identifier we shouldn’t shy away from doing just that. We avoid all the caveats and concerns about determinism when we write self-describing types as an ordered concatenation of bytes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I figured 👍. The one caveat I have is:

If you can write a codec in a few lines of code for a new data structure that is already its own multicodec identifier we shouldn’t shy away from doing just that.

I think we should still shy away from it since I'd prefer the number of codecs to be smaller rather than bigger as it puts pressure on generalized systems like IPFS to get bigger and bigger as they need to add support for more and more codecs. They also take up slots in the code table and arguably any block format could want a small code number because people could be storing tons of blocks and could then easily save some space.

Copy link
Contributor

@warpfork warpfork Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good question. "why would we prefer x over cbor?" is almost always a valid question (so much so that we should almost put it in a checklist for proposals!), and this context is no different.

In this case: I could see users remarking on the bytes shaved off as useful, yes. And the simplicity also does seem potentially relevant: that this format will never, ever need to be capable of any kind of recursion, since it's just a small wrapper for the ciphertext, does make it possible to implement with considerably simpler mechanisms than cbor.

I don't think either of those arguments is open-and-shut, but they're enough to make me receptive to a new block format, at least.

@@ -47,25 +48,29 @@ payload. Since the initializing vector is a known size for each AES variant the
format is simply the iv followed by the byte payload.

```
| iv | bytes |
| varint(cipher-multicodec) | varint(ivLength) | iv | bytes |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an argument about using a custom block format because of it's size compared to e.g. DAG-CBOR.

The current proposal is | varint(cipher-multicodec) | varint(ivLength) | iv | bytes | with the currently choosen multicodecs that's 3 bytes overhead compared to the actual data.

As a custom format needs custom code anyway, you could just hard-code the lengths of the iv based on the cipher and use an enum (and not a multicodec for the cipher type), so you end up with 1 byte overhead.

If you would use DAG-CBOR, when using an enum (and not multicodecs) for the ciphers and using list representation the overhead would be 7 bytes.

I'd go for either the 1 byte overhead or codec independent (e.g. DAG-CBOR), but I don't have a strong opinion.

Copy link

@aschmahmann aschmahmann Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting point about needing custom code anyway. Although it depends on where in the stack this custom code lives. If a new cipher is added you'd now have to change the codec code which is a little awkward.

For example, if we added support for this codec in go-ipfs but had not added support for a new cipher. With the extra bytes some python or js client could happily fetch the data from go-ipfs and decrypt with their custom cipher without any problems, by removing those bytes now the user needs to send a PR to go-ipfs to modify the codec to support their cipher.

| varint(cipher-multicodec) | varint(ivLength) | iv | bytes |
```

This is decoded into IPLD data model of the following schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put a title in here, or a just inline the wording that says that this is the "Logical Format"? We use that terminology in the DAG-PB spec to make it distinct from the wire format and I think it's a really good framing for these schemas that talk about how we instantiate data forms out of a soup of bytes that could be interpreted in any number of ways.

| CID | Bytes
```

The decrypted state is decoded into IPLD data model of the following schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment re "Logical Format"

@rvagg
Copy link
Member

rvagg commented Jan 29, 2021

Some questions I have for crypto-heads:

  • Should we exclude CBC from this proposal entirely - I think I could imagine an IPLD-based communication system where you could construct a padding oracle attack, but maybe that’s not reasonable
  • Adding in CCM is nice because it shows the initial extensibility of the system, but would anyone have reason to choose CCM and therefore should we just exclude it initially to encourage people to use GCM?
  • ChaCha20 seems like a good addition for an initial suite, is that reasonable?
  • Do we have any reason to consider authentication tags, AAD, or even AEAD in this situation? We’re not trying to build TLS with this, but some of it feels like it’s coming close to needing additional features, especially depending on how someone might deploy such a system and what they use it for.

As per multiformats/multicodec#202 (comment) I've also proposed that we add keylength to the AES cipher entries in the multicodec table, so you'd choose aes-256-gcm for example.

@rvagg
Copy link
Member

rvagg commented Feb 1, 2021

OK, I had a brief discussion with @nikkolasg about this and did some more thinking and researching and here's my current position:

  • Let's leave off the auth tag, additional data and AEAD concerns for now, this format should be enough to capture the basic case - we get authentication, to some degree, with it just being encapsulated in an IPLD block with a content address. But there's going to be use-cases where this is applied where things start to break down and we might want to add an additional format that adds some of these authentication features. This may also be a documentation problem for us - don't expect this to solve all your problems and work for all your use-cases, know the limits of what this offers!
  • Let's remove CBC, it may not be a problem in general for how we mostly expect this to work but since it's known to be flawed in the typical environments its used it's got a stain that follows it around and we can just step over by not even touching it.
  • CCM is .. ok .. I just don't think there's a point - if you're choosing AES then you're choosing GCM at the moment (maybe we'll be talking about AEX in a few years, but that's not yet).
  • Adding in ChaCha20 would be good, it's typically coupled with AES in cipher suites because it's faster where you don't have hardware implementations and is treated as a "we have this other entirely different thing in our suite in case AES is found to be insecure one day and we're not scrambling for a replacement through our whole deployed stack".
  • XChaCha20 is worth considering as an addition, or instead of ChaCha20. It does bigger nonces and is supposed to be more secure across larger numbers of messages with the same key for that reason.
  • We should dictate that implementations support both the AES and ChaCha20-based ciphers for the reason stated above.

Mostly though, I think the format is fine for now, we can add a new multicodec for an extended-encryption if we need it later.

@warpfork
Copy link
Contributor

What happened here? Are these spec changes we should merge as specs, or are they things we should keep in exploration report territory until further ratified and have more implementations? Who's working on it?

I'd love to land some of the data here, whether it's as fully-finished-and-ratified specs, or architecture design records, or exploration reports, I don't really care, I just want to get some more stuff out of the "open PRs" lane :)

@rvagg
Copy link
Member

rvagg commented Apr 21, 2021

Stalled @ multiformats/js-multiformats#59 but super close. I know Textile are interested in trying to use this so we could push it over the line. Project proposal @ protocol/web3-dev-team#49 to get it wrapped up and my estimation is that it's fairly low investment to do.

@ghost

This comment has been minimized.

@ghost

This comment has been minimized.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants