Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

codec: AES encrypted blocks #349

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 83 additions & 0 deletions block-layer/codecs/encrypted.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Specification: Encrypted Block Codec

**Status: Prescriptive - Draft**

This document describes codecs for IPLD blocks (CID + Data) that are encrypted. The
multicodec idenfier for the cipher and the initital vector are included in the block
format and parsed into a standardized data model representation.

The following known ciphers may be referenced in encrypted blocks:

| name | multicodec |
| --- | --- |
| aes-gcm | 0x1401 |
| aes-cbc | 0x1402 |
| aes-ctr | 0x1403 |

## What this spec is not

This is not a complete system for application privacy. The following issues are
out of scope for this specification, although they can obviously leverage these codecs:

* Key signaling
* Access controls
* Dual-layer encryption w/ replication keys

How you determine what key to apply to an encrypted block will need to be done in the
application layer. How you decide to encrypt a graph and potentially link the encrypted
blocks together for replication is done at the application layer. How you manage and access
keys is done in the application layer.

## Encode/Decode vs Encrypt/Decrypt

The goal of specifying codecs that are used for encryption is to allow the codecs to
include encryption and decryption programs alongside the codec. However, encrypting and
decrypting are done by the user and are not done automatically as part of any encode/decode
operation in the codec.

The encryption program returns a data model value suitable for the block encode program. The
decode program provides a data model value suitable for the decryption program. And the decryption
program provides a data model value suitable for parsing into a new block (CID and Bytes). These
programs are designed to interoperate but it's up to the user to combine them and provide the
necessary key during encryption and decryption.

## Encrypted Block Format

An encrypted block can be decoded into its initializing vector and the encrypted byte
payload. Since the initializing vector is a known size for each AES variant the block
format is simply the iv followed by the byte payload.

```
| varint(cipher-multicodec) | varint(ivLength) | iv | bytes |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that any justification is needed to propose a new block format, but what is the context for using this over cbor?

Is it because we think encryption is so useful and common that shaving the bytes off the field name is worthwhile? Is it because we think this format is so much simpler to implement than cbor that it can be used across more of the ecosystem? Something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most obvious reason is that it’s substantially smaller than CBOR. There’s no reason for us to leverage large general purpose block formats when jamming a few buffers and/or varints together will work.

We should never assume that CBOR or DagCBOR are already available. These formats are not that widely used outside of our bubble and are nowhere near the adoption level that I’d consider we can “take for granted” like we might JSON.

When we leverage an existing codec we then also have to do schema validation on the input and output in order to ensure determinism since the generic codec will allow any number of generic properties to be there in addition to what we’ve defined in the spec.

If you can write a codec in a few lines of code for a new data structure that is already its own multicodec identifier we shouldn’t shy away from doing just that. We avoid all the caveats and concerns about determinism when we write self-describing types as an ordered concatenation of bytes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I figured 👍. The one caveat I have is:

If you can write a codec in a few lines of code for a new data structure that is already its own multicodec identifier we shouldn’t shy away from doing just that.

I think we should still shy away from it since I'd prefer the number of codecs to be smaller rather than bigger as it puts pressure on generalized systems like IPFS to get bigger and bigger as they need to add support for more and more codecs. They also take up slots in the code table and arguably any block format could want a small code number because people could be storing tons of blocks and could then easily save some space.

Copy link
Contributor

@warpfork warpfork Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good question. "why would we prefer x over cbor?" is almost always a valid question (so much so that we should almost put it in a checklist for proposals!), and this context is no different.

In this case: I could see users remarking on the bytes shaved off as useful, yes. And the simplicity also does seem potentially relevant: that this format will never, ever need to be capable of any kind of recursion, since it's just a small wrapper for the ciphertext, does make it possible to implement with considerably simpler mechanisms than cbor.

I don't think either of those arguments is open-and-shut, but they're enough to make me receptive to a new block format, at least.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an argument about using a custom block format because of it's size compared to e.g. DAG-CBOR.

The current proposal is | varint(cipher-multicodec) | varint(ivLength) | iv | bytes | with the currently choosen multicodecs that's 3 bytes overhead compared to the actual data.

As a custom format needs custom code anyway, you could just hard-code the lengths of the iv based on the cipher and use an enum (and not a multicodec for the cipher type), so you end up with 1 byte overhead.

If you would use DAG-CBOR, when using an enum (and not multicodecs) for the ciphers and using list representation the overhead would be 7 bytes.

I'd go for either the 1 byte overhead or codec independent (e.g. DAG-CBOR), but I don't have a strong opinion.

Copy link

@aschmahmann aschmahmann Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting point about needing custom code anyway. Although it depends on where in the stack this custom code lives. If a new cipher is added you'd now have to change the codec code which is a little awkward.

For example, if we added support for this codec in go-ipfs but had not added support for a new cipher. With the extra bytes some python or js client could happily fetch the data from go-ipfs and decrypt with their custom cipher without any problems, by removing those bytes now the user needs to send a PR to go-ipfs to modify the codec to support their cipher.

```

This is decoded into IPLD data model of the following schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put a title in here, or a just inline the wording that says that this is the "Logical Format"? We use that terminology in the DAG-PB spec to make it distinct from the wire format and I think it's a really good framing for these schemas that talk about how we instantiate data forms out of a soup of bytes that could be interpreted in any number of ways.


```ipldsch
type AES struct {
code Int

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I would've chosen to try experimenting with @type instead of code to see how it feels since it might encourage library development that takes nominative typing into account (e.g. #71 uses @type: codecNumber as an example).

However, I definitely understand just wanting to roll something out for encryption without worrying about longer term ramifications or bikeshedding on what @type could look like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code has been how we typically refer to multicodecs in a variety of locations. in this case, these all correspond to multicodecs so it makes sense.

this representation ends up being kind of important in js-multiformats because it consistently destructures to the correct property name across the system

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very potäto/potato to me. I sincerely doubt making a field called "@type" instead of "code" here is going to influence anyone's hypothetical future library development in any clear direction.

Also: "@type" would not be syntactically valid IPLD Schema syntax: symbols are not permitted in field names. And for this, the reason in turn is: suppose someone wants to feed this into codegenerators: that "@" isn't going to be a valid field name in most programming languages under the sun, either.

Copy link

@aschmahmann aschmahmann Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm likely missing lots of context on js-multiformats but

because it consistently destructures to the correct property name across the system

is the point I was getting at with @type, the more structures adopt a convention like that this the easier it becomes to reuse it in the future. (e.g. if @type defined could be defined as a union of { Int | String | CID ... } similar to the proposal ).

That block format may literally just be dag-cbor, I don’t think it’s worth producing formal rules about format re-use.

(from @mikeal's comment #349 (comment))

I disagree, I think separating the block layer from the application layer entirely is super useful. That means minimizing special semantics associated with codec names as much as possible such that they are only used to decode the bytes and get IPLD Data Model output.

Also: @type would not be syntactically valid IPLD Schema syntax: symbols are not permitted in field names. And for this, the reason in turn is: suppose someone wants to feed this into codegenerators: that "@" isn't going to be a valid field name in most programming languages under the sun, either.

That's interesting and seems to push even more towards building libraries where a field like @type is considered special since it's a valid data model field name but not a valid schema name, and we've been pushing people towards describing their data with schemas. i.e. doing this in application space would be a pain since it's not valid schema syntax and it is a generically useful feature -> do it below the application space?

btw I know we've gone at this a bunch already recently and while I'm happy to continue, consider my comment "However, I definitely understand just wanting to roll something out for encryption..." license to just say "I disagree, but we can go at this another time" 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll try to clarify.

In js-multiformats there are tools for defining and working with codecs, hashers, base encoders and now encryption/decryption programs.

In all cases the “lookup” for a given implementation is done with an integer code. Our implementations of CID and multihash include a code property that matches a code property on codec and hash implementations. Looking up a given implementation is easy and consistent since you can do const lookup = ({ code }) => { /* return implementation */ } and have it work consistently across all multiformats.

That’s why it’s so useful to have the data model representation use a code property because we can then lookup the decryption program in multiformats by just passing the return value from decode() directly to a lookup function that already works with CIDs and multihashes.

We aligned all of these in the last major refactor when we moved to integer codes in order to avoid needing to ship every codec and base encoder, and the multiformats table, in order to support string names for codecs and hashers.

iv Bytes
bytes Bytes
} representation map
```

The `code` property can be used to looking the decryption program in order to arrive
at the decrypted block format below.

## Decrypted Block Format

The decrypted payload has a defined format so that it can be parsed into a pair of `CID` and
`bytes`.

```
| CID | Bytes
```

The decrypted state is decoded into IPLD data model of the following schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment re "Logical Format"


```ipldsch
type DecryptedBlock struct {
cid Link
bytes Bytes
} representation map
```