-
Notifications
You must be signed in to change notification settings - Fork 108
codec: AES encrypted blocks #349
base: master
Are you sure you want to change the base?
Changes from all commits
2f07383
4a60e40
03d54df
417e69c
bbb6014
bf8e5e0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
# Specification: Encrypted Block Codec | ||
|
||
**Status: Prescriptive - Draft** | ||
|
||
This document describes codecs for IPLD blocks (CID + Data) that are encrypted. The | ||
multicodec idenfier for the cipher and the initital vector are included in the block | ||
format and parsed into a standardized data model representation. | ||
|
||
The following known ciphers may be referenced in encrypted blocks: | ||
|
||
| name | multicodec | | ||
| --- | --- | | ||
| aes-gcm | 0x1401 | | ||
| aes-cbc | 0x1402 | | ||
| aes-ctr | 0x1403 | | ||
|
||
## What this spec is not | ||
|
||
This is not a complete system for application privacy. The following issues are | ||
out of scope for this specification, although they can obviously leverage these codecs: | ||
|
||
* Key signaling | ||
* Access controls | ||
* Dual-layer encryption w/ replication keys | ||
|
||
How you determine what key to apply to an encrypted block will need to be done in the | ||
application layer. How you decide to encrypt a graph and potentially link the encrypted | ||
blocks together for replication is done at the application layer. How you manage and access | ||
keys is done in the application layer. | ||
|
||
## Encode/Decode vs Encrypt/Decrypt | ||
|
||
The goal of specifying codecs that are used for encryption is to allow the codecs to | ||
include encryption and decryption programs alongside the codec. However, encrypting and | ||
decrypting are done by the user and are not done automatically as part of any encode/decode | ||
operation in the codec. | ||
|
||
The encryption program returns a data model value suitable for the block encode program. The | ||
decode program provides a data model value suitable for the decryption program. And the decryption | ||
program provides a data model value suitable for parsing into a new block (CID and Bytes). These | ||
programs are designed to interoperate but it's up to the user to combine them and provide the | ||
necessary key during encryption and decryption. | ||
|
||
## Encrypted Block Format | ||
|
||
An encrypted block can be decoded into its initializing vector and the encrypted byte | ||
payload. Since the initializing vector is a known size for each AES variant the block | ||
format is simply the iv followed by the byte payload. | ||
|
||
``` | ||
| varint(cipher-multicodec) | varint(ivLength) | iv | bytes | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is an argument about using a custom block format because of it's size compared to e.g. DAG-CBOR. The current proposal is As a custom format needs custom code anyway, you could just hard-code the lengths of the If you would use DAG-CBOR, when using an enum (and not multicodecs) for the ciphers and using list representation the overhead would be 7 bytes. I'd go for either the 1 byte overhead or codec independent (e.g. DAG-CBOR), but I don't have a strong opinion. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An interesting point about needing custom code anyway. Although it depends on where in the stack this custom code lives. If a new cipher is added you'd now have to change the codec code which is a little awkward. For example, if we added support for this codec in go-ipfs but had not added support for a new cipher. With the extra bytes some python or js client could happily fetch the data from go-ipfs and decrypt with their custom cipher without any problems, by removing those bytes now the user needs to send a PR to go-ipfs to modify the codec to support their cipher. |
||
``` | ||
|
||
This is decoded into IPLD data model of the following schema. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you put a title in here, or a just inline the wording that says that this is the "Logical Format"? We use that terminology in the DAG-PB spec to make it distinct from the wire format and I think it's a really good framing for these schemas that talk about how we instantiate data forms out of a soup of bytes that could be interpreted in any number of ways. |
||
|
||
```ipldsch | ||
type AES struct { | ||
code Int | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps I would've chosen to try experimenting with However, I definitely understand just wanting to roll something out for encryption without worrying about longer term ramifications or bikeshedding on what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
this representation ends up being kind of important in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems very potäto/potato to me. I sincerely doubt making a field called "@type" instead of "code" here is going to influence anyone's hypothetical future library development in any clear direction. Also: "@type" would not be syntactically valid IPLD Schema syntax: symbols are not permitted in field names. And for this, the reason in turn is: suppose someone wants to feed this into codegenerators: that "@" isn't going to be a valid field name in most programming languages under the sun, either. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm likely missing lots of context on
is the point I was getting at with
(from @mikeal's comment #349 (comment)) I disagree, I think separating the block layer from the application layer entirely is super useful. That means minimizing special semantics associated with codec names as much as possible such that they are only used to decode the bytes and get IPLD Data Model output.
That's interesting and seems to push even more towards building libraries where a field like btw I know we've gone at this a bunch already recently and while I'm happy to continue, consider my comment "However, I definitely understand just wanting to roll something out for encryption..." license to just say "I disagree, but we can go at this another time" 😄 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I’ll try to clarify. In In all cases the “lookup” for a given implementation is done with an integer code. Our implementations of CID and multihash include a That’s why it’s so useful to have the data model representation use a We aligned all of these in the last major refactor when we moved to integer codes in order to avoid needing to ship every codec and base encoder, and the multiformats table, in order to support string names for codecs and hashers. |
||
iv Bytes | ||
bytes Bytes | ||
} representation map | ||
``` | ||
|
||
The `code` property can be used to looking the decryption program in order to arrive | ||
at the decrypted block format below. | ||
|
||
## Decrypted Block Format | ||
|
||
The decrypted payload has a defined format so that it can be parsed into a pair of `CID` and | ||
`bytes`. | ||
|
||
``` | ||
| CID | Bytes | ||
``` | ||
|
||
The decrypted state is decoded into IPLD data model of the following schema. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same comment re "Logical Format" |
||
|
||
```ipldsch | ||
type DecryptedBlock struct { | ||
cid Link | ||
bytes Bytes | ||
} representation map | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that any justification is needed to propose a new block format, but what is the context for using this over cbor?
Is it because we think encryption is so useful and common that shaving the bytes off the field name is worthwhile? Is it because we think this format is so much simpler to implement than cbor that it can be used across more of the ecosystem? Something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most obvious reason is that it’s substantially smaller than CBOR. There’s no reason for us to leverage large general purpose block formats when jamming a few buffers and/or varints together will work.
We should never assume that CBOR or DagCBOR are already available. These formats are not that widely used outside of our bubble and are nowhere near the adoption level that I’d consider we can “take for granted” like we might JSON.
When we leverage an existing codec we then also have to do schema validation on the input and output in order to ensure determinism since the generic codec will allow any number of generic properties to be there in addition to what we’ve defined in the spec.
If you can write a codec in a few lines of code for a new data structure that is already its own multicodec identifier we shouldn’t shy away from doing just that. We avoid all the caveats and concerns about determinism when we write self-describing types as an ordered concatenation of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I figured 👍. The one caveat I have is:
I think we should still shy away from it since I'd prefer the number of codecs to be smaller rather than bigger as it puts pressure on generalized systems like IPFS to get bigger and bigger as they need to add support for more and more codecs. They also take up slots in the code table and arguably any block format could want a small code number because people could be storing tons of blocks and could then easily save some space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good question. "why would we prefer x over cbor?" is almost always a valid question (so much so that we should almost put it in a checklist for proposals!), and this context is no different.
In this case: I could see users remarking on the bytes shaved off as useful, yes. And the simplicity also does seem potentially relevant: that this format will never, ever need to be capable of any kind of recursion, since it's just a small wrapper for the ciphertext, does make it possible to implement with considerably simpler mechanisms than cbor.
I don't think either of those arguments is open-and-shut, but they're enough to make me receptive to a new block format, at least.