Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3202: Encrypted appservices #3202

Open
wants to merge 13 commits into
base: old_master
Choose a base branch
from
158 changes: 158 additions & 0 deletions proposals/3202-encrypted-appservices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# MSC3202: Encrypted Appservices

Presently, appservices in Matrix are capable of attaching themselves to a homeserver for high-traffic
bot-like usecases, such as bridging and operationally expensive bots. Traditionally, these appservices
only work in unencrypted rooms due to not having enough context on the encryption state to actually
function properly.

This MSC targets the missing bits to support encryption at the appservice level: other MSCs, such as
[MSC2409](https://github.com/matrix-org/matrix-doc/pull/2409) and [MSC2778](https://github.com/matrix-org/matrix-doc/pull/2778)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[MSC2409](https://github.com/matrix-org/matrix-doc/pull/2409) and [MSC2778](https://github.com/matrix-org/matrix-doc/pull/2778)
[MSC4203](https://github.com/matrix-org/matrix-doc/pull/4203) and [MSC2778](https://github.com/matrix-org/matrix-doc/pull/2778)

give appservices foundational pieces to get device IDs and to-device messages, as required by encryption.

## Proposal

This proposal takes inspiration from [MSC2409](https://github.com/matrix-org/matrix-doc/pull/2409) by
defining a new set of keys on the appservice `/transactions` endpoint, similar to sync:

```json5
{
"events": [
// as defined today
],
"ephemeral": [
// MSC2409
],
"to_device": [
// MSC2409
],
"device_lists": {
"changed": ["@alice:example.org"],
"left": ["@bob:example.com"]
},
"device_one_time_keys_count": {
turt2live marked this conversation as resolved.
Show resolved Hide resolved
"@_irc_bob:example.org": {
turt2live marked this conversation as resolved.
Show resolved Hide resolved
"DEVICEID": {
"curve25519": 10,
"signed_curve25519": 20
}
}
},
"device_unused_fallback_key_types": {
"@_irc_bob:example.org": {
"DEVICEID": ["signed_curve25519"]
}
}
}
```

These fields are heavily inspired by [the extensions to /sync](https://matrix.org/docs/spec/client_server/r0.6.1#id84)
in the client-server API.

All the new fields can be omitted if there are no changes for the appservice to handle. For
`device_one_time_keys_count` and `device_unused_fallback_key_types`, the format is slightly different
from the client-server API to better map the appservice's user namespace users to the counts. Users
in the namespace without keys or which have unchanged keys since the last transaction can be omitted
(more details on this later on). Note that fallback keys are described in
Comment on lines +53 to +55
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should the AS distinguish between users without keys and users with unchanged keys if you can omit them for both cases?

In my implementation thus far I've been sending empty dicts for the 'without keys' case. I'm now reading that this is unnecessary*, but would be interested in the answer to the above.

*I'm especially sceptical of 'no unused fallback keys' and 'no change in your unused fallback keys' having the same representation over the wire (and thus being indistinguishable). I personally think it's 'wrong' :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSC2732 has this to say about including the field in /sync: The device_unused_fallback_key_types parameter must be present if the server supports fallback keys. Clients can thus treat this field as an indication that the server supports fallback keys, and so only upload fallback keys to servers that support them.

My personal opinion here is that the per-user entries should be omissible if and only if there have been no changes for that user.
If we wanted to require an indication that the homeserver supports fallback keys, we'd only need to require that the outer dict be present (rather than all the users' entries) — don't know if it makes sense to require this though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agh, after reading my code, I notice that I also include empty lists for all the devices if there are no unused fallback keys. I think this is in keeping with what /sync does, but again, it would be nice to have this commented on specifically by the MSC so I know whether what I'm doing is right or wrong before it gets merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory the appservice is doing its own state tracking for whether or not it has uploaded keys (of any variety). If for some reason it loses this state, it can do two things to recover:

  1. Query for its own device keys, uploading new ones if none exist. If device keys do exist and the appservice doesn't know what they are, it can pick/create a new device for that user and upload fresh keys to that device.
  2. Upload a new fallback key, even if unused.

So it should be perfectly safe to exclude fields in both cases, though the MSC is also assuming the server is working off a transaction model under the hood and over the wire: it's a lot easier to see that the appservice was already informed of a change when there's a transaction log to say that was sent.

Whether the server supports fallback keys or not is somewhat irrelevant for appservices: if an appservice requires it, it can just mark it in documentation for server compatibility. Plus, with the new Matrix versioning, the appservice can declare a minimum supported spec version that operators can compare against their homeserver installs. Clients are a bit different because it's a lot easier to accidentally/on purpose use a modern client against an old server, so the feature check becomes a bit more important (at least while fallback keys weren't in a spec release).

Copy link
Contributor

@reivilibre reivilibre Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@turt2live Mostly reasonable, but I'm left with the question of: how is the appservice meant to distinguish between 'no difference in your fallback keys' and 'all your fallback keys have been consumed' if both of them are represented by omission?

My suggestion for how this should work would be: if a device entry is specified, then the list (UFBKs)/dict (OTK counts) is the whole new state for that device and replaces the AS's current state for that device. If a device entry is omitted, then there's no change.
In this system, 'all your fallback keys have been consumed' would be represented by the empty list [] of fallback keys. Thereafter it could theoretically be omitted if the state continues to be 'no keys'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state change of being unused to used would be in a transaction, but further transactions don't need to include it. This is represented by [] already.

Copy link
Contributor

@reivilibre reivilibre Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state change of being unused to used would be in a transaction, but further transactions don't need to include it. This is represented by [] already.

Perfect; I think the wording of the '...users without keys can be omitted...' should be altered to clarity that

[MSC2732](https://github.com/matrix-org/matrix-doc/pull/2732) as of writing.

Like MSC2409, any user the appservice would be considered "interested" in (user in the appservice's
namespace, or sharing a room with an appservice user/namespaced room) would qualify for the device
list changes section.
turt2live marked this conversation as resolved.
Show resolved Hide resolved

Note that it's typical for clients to pause sync loops when processing device list changes to avoid
a scenario where they are unable to decrypt/encrypt a message from/to a particular device. Appservices
are expected to mirror this by ensuring the transaction request does not complete until processing
is complete. In the worst case, the server will time out the request and retry it verbatim, so
appservices might wish to track which device list changes in which transaction they already processed
or keep processing transactions in the background while retries are attempted.

In order to allow the appservice to masquerade as its users, an extension to the existing
[identity assertion](https://matrix.org/docs/spec/application_service/r0.1.2#identity-assertion)
ability is proposed. To compliment the (optional) `user_id` when using an `as_token` as an access
token, a similarly optional `device_id` query parameter is proposed. When provided, the server asserts
that the device ID is valid for the user, and that the appservice is able to masquerade as that user.
Comment on lines +72 to +73
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should be the HTTP and matrix-level error code if it's not valid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for lack of a better answer, let's reuse 400 M_EXCLUSIVE for now. We'll probably want to introduce an error code in the future.

If valid, that device ID should be assumed as being used for that request. For many requests, this
means updating the "last seen IP" and "last seen timestamp" for the device, however for some endpoints
it means interacting with that device (such as when uploading keys).

### Optimization: when to send OTKs/fallback keys

As mentioned above, in order to keep the transaction byte size down the server can (and should) exclude
OTK counts and unused fallback keys when they haven't changed since the last transaction. Appservices
however should be tolerable of the server over-communicating the counts as a quick/cheap approach would
be to *always* include the OTK counts/unused fallback keys for all known users rather than trying to
detect changes.

As a middle ground, servers might be interested in an algorithm which doesn't detect changes between
transactions but does attempt to reduce traffic. If the appservice is about to receive an event or
message typically associated with encryption, the counts for the affected users could be included. This
would result in the following rules:
* If an `m.room.encrypted` event is being included in the transaction's `events`, include OTK counts and
unused fallback key types for all appservice users which reside in that room.
* If an appservice user is receiving a to-device message in the transaction's `to_device` array, include
OTK counts and unused fallback key types for that user.

This approach has the advantage of typically minimal changes to the internals of the homeserver, works
similar to `/sync`, and reduces noisy traffic in the transaction sending. This additionally still honours
the "when they change, send the counts" requirement to a reasonable degree: typically a use of an OTK will
be followed by a to-device message. It is theoretically possible for an appservice to run out of OTKs if
a remote user claims all OTKs without actually using them. Implementations may be interested in
[MSC2732: Fallback keys](https://github.com/matrix-org/matrix-doc/pull/2732) which will avoid a scenario
where the appservice can no longer decrypt messages.

However, as mentioned, servers are free to include this information as little or often as they'd like,
provided they send it at least as often as when it changes.

### Optimization: Don't encrypt as often

Appservices theoretically do not need to establish Olm sessions with other appservice users as the appservice
will typically be managing the devices in one place. In short, this means that a room with 10k appservice
users and only 1 non-appservice user can be sped up by only encrypting from the appservice's users to the
non-appservice user. The appservice would not need to set up 10k * 10k Olm sessions given the encryption
and decryption all happens in the same place. As an added bonus, this improves performance of the appservice
as it doesn't have to handle to-device messages sent to itself.

Some implementations might not be able to support this sort of optimization though, so it is still permitted
to establish sessions and such between appservice users.

## Potential issues

Servers would have to track and send this information to appservices, however this is still perceived
to be more performant than appservices using potentionally thousands of `/sync` streams.

Appservices additionally cannot opt-in (or out) of this functionality unlike with MSC2409. It is
expected that servers will optimize for not including/calculating the fields if the appservice has
no interest in the information. Specifically, appservices which don't have any keys under their user
namespace can be assumed to not need device list changes and thus can be optimized out.

## Alternatives

An endpoint for appservices to poll could work, though this is extra work for the appservice and would
likely need pagination and such, which is all heavyweight for the server. Instead, having the server
batch up updates and send them to the appservice is likely faster.

## Security considerations

None relevant - this is the same information the appservice would get if it spawned `/sync` streams for
all the users in its namespace.

## Unstable prefix

While this MSC is not considered stable for implementation, implementations should use `org.matrix.msc3202.`
as a prefix to the fields on the `/transactions` endpoint. For example:
* `device_lists` becomes `org.matrix.msc3202.device_lists`
* `device_one_time_keys_count` becomes `org.matrix.msc3202.device_one_time_keys_count`
* `device_unused_fallback_key_types` becomes `org.matrix.msc3202.device_unused_fallback_key_types`

Appservices which support encryption but never see these fields (ie: server is not implementing this in an
unstable capacity) should be fine, though encryption might not function properly for them. It is the
responsibility of the appservice to try and recover safely and sanely, if desired, when the server is not
implementing this in an unstable capacity. This is not a concern once the MSC becomes stable in a released
version of the specification, as servers will be required to implement it.

For servers wishing to force appservices to opt-in to this behaviour, they may use `org.matrix.msc3202: true`
in the registration file. Servers will be able to check for "opt-in" behaviour once this MSC is stable by
seeing whether or not the appservice has an encryption-capable device recorded in its users namespaces.

To use device ID masquerading, implementations should use `org.matrix.msc3202.device_id` instead of `device_id`
in the query string while this MSC is considered unstable.