Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2162: Signaling Errors at Bridges #2162

Open
wants to merge 17 commits into
base: old_master
Choose a base branch
from
125 changes: 125 additions & 0 deletions proposals/2162-signaling-errors-at-bridges.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Signaling Errors at Bridges

Sometimes bridges just silently swallow messages and other events. This proposal
enables bridges to communicate that something went wrong and gives clients the
option to give feedback to their users.

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal does not cover how bridges de-flag errors (eventual success in sending a message). I am assuming they redact their original error event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this proposal is only handling the case of when the bridge gives up trying to send.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds sub-par tbh. We'd need a retry mechanism so that users aren't left stranded, or at the very least support redaction as a way to indicate clearing of the error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a good chance that a message will eventually be delivered, I don't think it belongs in this proposal. We should try to unify that case with a general “delivery delay notification” solution for the whole Matrix universe so the work has to be done only once. I am currently writing a bit about what I have in mind about those “delivery delay notifications” and there can discussion about that as well. (Also not quite sure where to have it then.) In the case of a message not being delivered with a high probability and just backing off in rare circumstances, redacting a permanent error might be adequate.

Until now I assumed the error is final and there is no retry, just a manual resend of the message. Could a bridge get the redaction and refetch the original event? Or might it be possible to simulate a resend with a no-op edit? If there is no satisfying way already, one could of course add another event type which is ignored by everyone but the bridge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Homeservers are expected to keep retrying the appservice until it comes back alive, but that can easily be hours or even days before the service responds. Most bridges nowadays have a condition for received messages where ti just drops messages which are too old, but between the time the bridge went down and the time the message was ignored the user's message was not delivered without notification.

Limiting the scope of the proposal to just fatal errors doesn't really help with communicating the bridge's status because there's many more temporary failures that people expect to hear about due to the nature of realtime communications. It's bad enough we already get complaints when it takes more than 10 seconds to send a message through 4 different points of failure to the remote network.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am coming round to the idea of sending a temporary failure PDU for things which have failed to send and are in a retry queue of some kind. Redacting that would imply it's been sent.

Separately there is a question of if this proposal should cover how the user can indicate they want to retry a message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully coming around enough to give the OK 😇

I'd be uncomfortable with this going into the spec if it only communicated permanent failures, because permanent failures are rare.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal doesn't explain who sends the error event. Is it the sender_localpart? a user in the namespace? Both of these options have consequences which make them non-ideal: the sender_localpart might not be in the room in the case of a puppet bridge and a user in the namespace is an undefined thing: anyone can claim they are in a namespace.

Copy link
Author

@V02460 V02460 Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing we want here is that “the bridge” does send the message, which is not a concept that maps straight to Matrix afaik. Instead we always need a proxy user for the bridge. There are two parts to get this right: Who is eligible to represent the bridge and how to make sure this info came from the bridge? This maps to the problems of authorization and authentication. @Half-Shot mentioned he had a proposal for this via the room state, so it might be a good idea to piggyback on that. (If it is a proposal I assume there is nothing else usable for us out there.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm failing to see the correlation between this and authorization for bridges (I also don't know what proposal that is). Bridges have a namespace of users and a sender_localpart: which user is supposed to send it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One user representing the bridge does send the message. It depends on who is in the room, so the answer is both of them.

We can't simply say the bridge bot user as it is sometimes not joined e.g. in 1:1 conversations. Then the virtual user of your communication partner does represent the bridge and it should send the bridge error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes problems (as does the regex later on) because clients won't be able to do sanity checking on errors. They don't have a concept of bridges or appservices, and would be unable to see that @alice:example.org doesn't have appropriate permissions over the Freenode IRC bridge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal should cover how much we care about random users impersonating bridges or bridges lying about their namespaces, and how we protect against that if we do care (we should).


Bridges might come into a situation where there is nothing more they can do to
successfully deliver an event to the foreign network they are connected to. Then
they should be able to inform the originating room of the event about this
delivery error.

### Bridge error event

This document proposes the addition of a new room event with type
`m.bridge_error`. It is sent by the bridge and references an event previously
V02460 marked this conversation as resolved.
Show resolved Hide resolved
sent in the same room, by that marking the original event as “failed to deliver”
for all users of a bridge. The new event type utilizes reference aggregations
([MSC
1849](https://github.com/matrix-org/matrix-doc/blob/matthew/msc1849/proposals/1849-aggregations.md))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up that @ara4n is rewriting this proposal, but I don't think it will affect this one https://github.com/matrix-org/matrix-doc/pull/2154/files

to establish the relation to the event its delivery it is marking as failed.
There is no need for a new endpoint as the existing `/send` endpoint will be
utilized.

Additional information contained in the event are the name of the bridged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually drop this now, when #2346 gets merged :)

network (e.g. “Discord” or “Telegram”) and a regex¹ describing the affected
users (e.g. `@discord_*:example.org`). This regex should be similar to the one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a regex, then it should be @discord_.*:example.org.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to jump in and say my preference is to keep it as a simple glob expression, rather than having to define the rules around regexes. I doubt we plan to do more than simple matching?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to define what kind of glob it is then, as we use two types in Matrix. Also if it's not a regex, don't call it a regex.

Being able to reuse the namespace from the registration is a strong argument for supporting proper regex here (which is better defined than globs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(which is better defined than globs).

FWIW, my impression is the opposite. There are several different syntaxes for regexes. But personally, I don't really care which one is picked, as long as it's well defined.

V02460 marked this conversation as resolved.
Show resolved Hide resolved
any Application Service uses for marking its reserved user namespace. By
providing this information clients can inform their users who in the room was
affected by the error and for which network the error occurred.

There are some common reasons why an error occurred. These are encoded in the
`reason` attribute and can contain the following types:

* `m.event_not_handled` Generic error type for when an event can not be handled
V02460 marked this conversation as resolved.
Show resolved Hide resolved
by the bridge. It is used as a fallback when there is no other more specific
reason.

* `m.event_too_old` A message will – with enough time passed – fall out of its
original context. In this case the bridge might decide that the event is too
old and emit this error.

* `m.foreign_network_error` The bridge was doing its job fine, but the foreign
network permanently refused to handle the event.

* `m.unknown_event` The bridge is not able to handle events of this type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of the type or format (or both)? A bridge might support text messages, but not HTML, or it might support stickers, but not emotes. The language here implies that a bridge has to support all of m.room.message.

Additionally, with the usual goal of a bridge to be instant messaging, I don't think it's fair to ever raise this error. The event types bridges wouldn't be able to send are the same event types it is unlikely to ever be able to send (server ACLs, room aliases, etc). On the other hand, if the bridge is able to raise this error for subtypes/formats of events it doesn't understand, this could be used to flag that a kick/ban was not carried over.

Can we put some recommendations in here that bridges shouldn't send this error for events outside of its use case (ie: for instant messaging, don't signal errors on ACLs, aliases, and custom event types).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtypes are fine, I think I clarified it now in the new version.

Let's assume a bridge that doesn't know about stickers yet. It would be very nice to have a sticker sent in that room trigger an m.unknown_event then.

I am more in favor of whitelisting some events by having the bridge recognize those event and not acting on them at all. Everything that is completely unknown would still trigger this. I don't expect there to be many false positives in practice, but if so, the UI for this type of error could be reduced to something less noisy.


V02460 marked this conversation as resolved.
Show resolved Hide resolved
* `m.bridge_unavailable` The homeserver couldn't reach the bridge.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a subclass to this which has shown to be problematic in recent weeks: When the homeserver is also dead, the users on other homeservers will see the message as delivered when in fact it is not.

I don't know if it makes total sense here given the traffic concern, but maybe flipping this proposal around for positive reactions to messages when they are delivered? Maybe a new kind of m.receipt for this specific purpose.

or maybe we train the general public that the bridge sending a read receipt is fine?

Presumably these ideas have already been covered, so I'm curious as to what the decisions were that led to it not being used.


* `m.no_permission` The bridge wanted to react to an event, but didn't have
V02460 marked this conversation as resolved.
Show resolved Hide resolved
the permission to do so.

Nothing prevents multiple bridge error events to relate to the same event. This
should be pretty common as a room can be bridged to more than one network at a
time.

The need for this proposal arises from a gap between the Matrix network and
other foreign networks it bridges to. Matrix with its eventual consistency is
unique in having a message delivery guarantee. Because of this property there is
no need in the Matrix network itself to model the failure of message delivery.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably there is for some obscure cases. For instance, if you set up a bridge in a read only room, the bridge might not be able to post messages. However, the foreign network likely doesn't support indicating failure so the room is the next best option to flag a potential problem. How it does that starts stepping into very questionable territory (extensible profiles, auth rules which support a limited set of per-user events, etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably de-scope this for the sake of this MSC landing any time soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we are de-scoping, the MSC text needs to be updated to reflect that and the edge cases that, like @turt2live mentioned, can still be encountered.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not quite sure if I properly get what you are saying, but I think it is now discussed in the Rights management section.

This need only arises for interactions with foreign networks where message
delivery might fail. This proposal extends Matrix to be aware of these error
cases.

Additionally there might be some operational restrictions of bridges which might
make it necessary for them to refrain from handling an event, e.g. when hitting
memory limits. In this case the new event type can be used as well.

This is an example of how the new bridge error might look:

```
{
"type": "m.bridge_error",
"content": {
"network: "Discord",
"affected_users": "@discord_*:example.org",
V02460 marked this conversation as resolved.
Show resolved Hide resolved
"reason": "m.event_too_old",
V02460 marked this conversation as resolved.
Show resolved Hide resolved
"m.relates_to": {
"rel_type": "m.reference",
"event_id": "$some:event.id"
}
}
}
```

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\
¹ Or similar – see *Security Considerations*

## Tradeoffs

Without this proposal, bridges could still inform users in a room that a
delivery failed by simply sending a plain message event from a bot account. This
possibility carries the disadvantage of conveying no special semantic meaning
with the consequence of clients not being able to adapt their presentation.

A fixed set of error types might be too restrictive to express every possible
condition. An alternative would be a free-form text for an error message. This
brings the problems of less semantic meaning and a requirement for
internationalization with it. In this proposal a generic error type is provided
for error cases not considered in this MSC.

## Potential issues

When the foreign network is not the cause of the error signaled but the bridge
itself (maybe under load), there might be an argument that responding to failed
messages increases the pressure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential issue is that this doesn't convey any error information if messages failed to send due to the bridge being down completely (as the bridge is unable to send the error messages).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the bridge comes back online, it will receive the missed events from the HS, so they might be handled after all. This would be only temporary and by that explicitly not covered by this proposal. The big thing to tackle here would be a mechanism to signal delivery delays which would add to the core Matrix network as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is something that should be addressed or at least mentioned in the proposal. If the homeserver cannot send the event to the bridge, it should send an error event on its behalf (which the bridge can later redact).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## Security considerations

Sending a custom regex with an event might open the doors for attacking a
homeserver and/or a client by exposing a direct pathway to the complex code of a
regex parser. Additionally sending arbitrary complex regexes might make Matrix
more vulnerable to DoS attacks. To mitigate these risks it might be sensible to
only allow a more restricted subset of regular expressions by e.g. requiring a
maximal length or falling back to simple globbing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be worked out in this MSC, as we'll state in the spec whether a field supports full regex or only simple globbing.

@Half-Shot would a bridge ever need more than globbing for calling out affected users? Currently application service registration allows for full regex parsing (https://matrix.org/docs/spec/application_service/unstable#registration). But this is on the bridge side, and thus if it kills the homeserver, it was the homeserver operator that was at fault for using a bad registration file. Things are entirely different from the C-S API side.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using globbing, there would be a conversion needed from the AS regex, which it should be based on. As the regex language is more powerful than globbing, some simplifications/hacks/heuristics are required there. Or having the bridge user add it in two different forms manually…


## Conclusion

In this document a new permanent event is proposed which a bridge can emit to
signal an error on its side. The event informs the affected room about which
message errored for which reason; it gives information about the affected users
and the bridged network. By implementing the proposal Matrix users will get more
insight into the state of their (un)delivered messages and thus they will become
less frustrated.