-
Notifications
You must be signed in to change notification settings - Fork 108
codec: AES encrypted blocks #349
base: master
Are you sure you want to change the base?
Changes from 5 commits
2f07383
4a60e40
03d54df
417e69c
bbb6014
bf8e5e0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
# Specification: AES CODECS | ||
|
||
**Status: Prescriptive - Draft** | ||
|
||
This document describes codecs for IPLD blocks (CID + Data) that are encrypted with | ||
an AES cipher. | ||
|
||
The following AES variants are defined in this spec: | ||
|
||
| name | multicodec | iv size (in bytes) | | ||
| --- | --- | --- | | ||
| aes-gcm | 0x1400 | 12 | | ||
| aes-cbc | 0x1401 | 16 | | ||
| aes-ctr | 0x1402 | 12 | | ||
|
||
## What this spec is not | ||
|
||
This is not a complete system for application privacy. The following issues are | ||
out of scope for this specification, although they can obviously leverage these codecs: | ||
|
||
* Key signaling | ||
* Access controls | ||
* Dual-layer encryption w/ replication keys | ||
|
||
How you determine what key to apply to an encrypted block will need to be done in the | ||
application layer. How you decide to encrypt a graph and potentially link the encrypted | ||
blocks together for replication is done at the application layer. How you manage and access | ||
keys is done in the application layer. | ||
|
||
## Encode/Decode vs Encrypt/Decrypt | ||
|
||
The goal of specifying codecs that are used for encryption is to allow the codecs to | ||
include encryption and decryption programs alongside the codec. However, encrypting and | ||
decrypting are done by the user and are not done automatically as part of any encode/decode | ||
operation in the codec. | ||
|
||
The encryption program returns a data model value suitable for the block encode program. The | ||
decode program provides a data model value suitable for the decryption program. And the decryption | ||
program provides a data model value suitable for parsing into a new block (CID and Bytes). These | ||
programs are designed to interoperate but it's up to the user to combine them and provide the | ||
necessary key during encryption and decryption. | ||
|
||
## Encrypted Block Format | ||
|
||
An encrypted block can be decoded into its initializing vector and the encrypted byte | ||
payload. Since the initializing vector is a known size for each AES variant the block | ||
format is simply the iv followed by the byte payload. | ||
|
||
``` | ||
| iv | bytes | | ||
``` | ||
|
||
This is decoded into IPLD data model of the following schema. | ||
|
||
```ipldsch | ||
type AES struct { | ||
iv Bytes | ||
bytes Bytes | ||
} representation map | ||
``` | ||
|
||
## Decrypted Block Format | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it's just me missing/misunderstanding something, but this section seems like it isn't related to the aes-* codec specs. Is there a codec that turns
into
If so, what is it called and where is it used/referenced? Is aes-gcm encoded data that properly satisfies There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We’re in a bit of a grey area as to what “is” and “is not” part of the codec. Sure, you could say that anything that falls outside the encode/decode function “is not” part of the codec. However:
However, I DO NOT want to overly formalize the representation of the block because, ideally, what is input/output for encrypt/decrypt functions would align with the IPLD library in that language and map to how that library handles blocks, which is NOT formalized and there are big differences between implementations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I get what you're saying, although I wonder if we should be disambiguating a bit in this doc by saying what implementers of the codec MUST do (e.g. follow the encoding/decoding rules for the encrypted block format) and what users of the codec SHOULD do (encrypt data of the form presented here, i.e. |
||
|
||
The decrypted payload has a defined format so that it can be parsed into a pair of `CID` and | ||
`bytes`. | ||
|
||
``` | ||
| uint32(CID byteLength) | CID | Bytes | ||
``` | ||
|
||
The decrypted state is decoded into IPLD data model of the following schema. | ||
|
||
```ipldsch | ||
type DecryptedBlock struct { | ||
cid Link | ||
bytes Bytes | ||
} representation map | ||
``` | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Grabbing this line to start a thread in GitHub so as not to clob up the PR if people aren't interested. I know it's a bit long, but the Slack thread that generated this was much longer 😄. I've mentioned this offline, but am adding my objections to these codecs here. I'm posting here instead of in the multicodec repo since, as @rvagg has mentioned a number of times before, in order to keep the code table open we shouldn't be gatekeepers on whether a code is "good enough" to get in the table. Instead we should just nudge people towards design patterns that are likely to work in the ecosystem and figure out what value the codes should get in the table (as well as prevent squatting lots of codes). SummaryOverall, my issues with this proposal are that when people request nominative types in IPLD (or failing that in the code table) we have (and do) push back by asserting that it should be handled at the application layer instead of the data layer, except for here where there's a particular built-in exception. Specifics
aes-gcm and aes-ctr have the same serialized raw bytes and IPLD Data Model format, which means they can have the same codec. Giving additional codes for the same data format is something we regularly push back on. This means we really only have two codecs and yet have defined three.
should be able to cope whether the data is serialized in the format described here, as CBOR, or any other serialization format. However, to reach the same behaviors that the priviledged aes-* codecs get would require reserving another code in the code table (e.g. aes-gcm-cbor) which is both a large hurdle and would eat up table entries making these the priviledged codecs. This spec even asserts that the application layer must be involved to get the encryption/decryption keys so why not just use normal codecs and put determining if it's aes-{gcm,cbc,ctr} in the application layer? AlternativesImplement a scheme based off of one of the proposals for nominative types in IPLD, such as #71. For example, define the AES struct as:
While the type could be represented in any number of ways, but we can keep things similar to this proposal by defining aes-{gcm,cbc,ctr} in the code table as abstract definitions of ciphers (as opposed to this particular implementation). If we thought it was worth saving a few bytes on the wire we could even make a custom codec (short-aes-encrypted-data) So for the cost of <5 bytes per encrypted blob we now have replaceable formats and an example of how people who need nominative types can implement them. If we think this case has such special requirements that it deserves to be special cased then I guess that's the price being paid, but IMO we should at least be explicit about it. For example are we're claiming those bytes are just that valuable? Alternatively, do we think the out-of-band signaling is so specially useful even though the application layer is needed to figure out what keys to use anyhow? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I really like the alternative. In regards to bytes overhead, I don't know enough about type codes, but looking at this PR it seems that sizes are fixed for a specific type code, so you would only need 1 byte per type code and infer the size from it (which isn't very self-describing, so I'm not saying we should do this, but I think we could if needed). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I won't want to derail the conversation too much further, but there were some lessons learned in this work: https://github.com/ipld/specs/blob/5b8f87ffa942f0ce30d53f799d49da1814f1273d/block-layer/codecs/dag-jose.md. Certainly, that is higher level than what is proposed here, but we might be able to borrow some concepts from: https://tools.ietf.org/html/rfc7518, at least in terms of @type definitions? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We probably need to revise this language quite a bit. Our specs lean heavily toward describing how IPLD works with generic codecs that support the full data model. It does a poor job of describing codecs like this and codecs like bitcoin, git, etc. Codecs that have a data model representation but do not serialize arbitrary data model. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think it still hold true. Codecs still describe conversions from and to the Data Model, even if it isn't the full Data Model and doesn't even support arbitrary things. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don’t have language that excludes these sorts of codecs, but we were just more concerned with full data model codecs when we wrote a lot of these specs and so the language is often biased in that direction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using
representation tuple
would be clearer. I know that a codec can decide how to encode a Data Model Map, but just concatenating two byte streams reminds me conceptually more of a an array/tuple than of a map. Same for theDecryptBlock
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This describes what the parsed data model state is, not the encoded binary. It should be a map, not a tuple, because even without a schema being applied it’s a map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to think about that quite a bit, now I get it. It makes sense. Then this codec only supports
Bytes
andMap
s (so replaceList
withMap
in my other comment).