matrix-org · V02460 · Jul 9, 2019 · Jul 10, 2019 · Aug 3, 2019 · Aug 3, 2019
diff --git a/proposals/2162-signaling-errors-at-bridges.md b/proposals/2162-signaling-errors-at-bridges.md
@@ -0,0 +1,312 @@
+# Signaling Errors at Bridges
+
+Sometimes bridges just silently swallow messages and other events. This proposal
+enables bridges to communicate that something went wrong and gives clients the
+option to give feedback to their users. Clients are given the possibility to
+retry a failed event and bridges can signal the success of the retry.
+
+## Proposal
+
+Bridges might come into a situation where there is nothing more they can do to
+successfully deliver an event to the foreign network they are connected to. Then
+they should be able to inform the originating room of the event about this
+delivery error. The user in turn should be able to instruct the bridge to retry
+sending the message that was presented him as failed; the bridge should have the
+ability to mark an error as being revoked.
+
+If [MSC 1410: Rich
+Bridging](https://github.com/matrix-org/matrix-doc/issues/1410) is utilized for
+this proposal it would additionally give the benefits of
+
+- trimming the number of properties required in each bridge error event by
+  separately providing these general infos about the bridge in the room state instead.
+- not requiring users representing the bridge to have admin power levels
+  (see [Rights management](#rights-management)).
+
+### Bridge error event
+
+This document proposes the addition of a new room event with type
+`m.bridge_error`. It is sent by the bridge and references an event previously
+sent in the same room, by that marking the original event as “failed to deliver”
+for all users of a bridge. The new event type utilizes reference aggregations
+([MSC
+1849](https://github.com/matrix-org/matrix-doc/blob/matthew/msc1849/proposals/1849-aggregations.md#relation-types))
+to establish the relation to the event its delivery it is marking as failed.
+There is no need for a new endpoint as the existing `/send` endpoint will be
+utilized.
+
+Additional information contained in the event are the name of the bridged
+network (e.g. “Discord” or “Telegram”) and a regex array¹ describing the
+affected users (e.g. `@discord_.*:example.org`). This regex array should be
+similar to the one any Application Service uses for marking its reserved user
+namespace. By providing this information clients can inform their users who in
+the room was affected by the error and for which network the error occurred.
+
+*Those two fields will not be required if the variant with [MSC 1410: Rich
+Bridging](https://github.com/matrix-org/matrix-doc/issues/1410) is adopted. In
+this case the same information is stored alongside other bridge metadata in the
+room state*
+
+There are some common reasons why an error occurred. These are encoded in the
+`reason` attribute and can contain the following types:
+
+* `m.event_not_handled` Generic error type for when an event can not be handled
+  by the bridge. It is used as a fallback when there is no other more specific
+  reason.
+
+* `m.event_too_old` A message will – with enough time passed – fall out of its
+  original context. In this case the bridge might decide that the event is too
+  old and emit this error.
+
+* `m.foreign_network_error` The bridge was doing its job fine, but the foreign
+  network permanently refused to handle the event.
+
+* `m.unknown_event` The bridge is not able to handle events of this type. It is
+  totally legitimate to “handle” an event by doing nothing and not throwing this
+  error. It is at the discretion of the bridge author to find a good balance
+  between informing the user and preventing unnecessary spam. Throwing this
+  error only for some subtypes of an event is fine.
+
+* `m.bridge_unavailable` The homeserver couldn't reach the bridge.
+
+* `m.no_permission` The bridge wanted to handle an event, but didn't have the
+  permission to do so.
+
+The bridge error can provide a `time_to_permanent` field. If this field is
+present it gives the time in milliseconds one has to wait before declaring the
+bridge error as permanent. As long as an error is younger than this time, the
+client can expect the possibility of the error being revoked. If a bridge error
+is permanent, it should not be revoked anymore. In case this field is missing,
+the error will never be considered permanent.
+
+Notes:
+
+- Nothing prevents multiple bridge error events to relate to the same event.
+  This should be pretty common as a room can be bridged to more than one network
+  at a time.
+
+- A bridge might choose to handle bridge error events, but this should never
+  result in emitting a new bridge error as this could lead to an endless
+  recursion.
+
+The need for this proposal arises from a gap between the Matrix network and
+other foreign networks it bridges to. Matrix with its eventual consistency is
+unique in having a message delivery guarantee. Because of this property there is
+no need in the Matrix network itself to model the failure of message delivery.
+This need only arises for interactions with foreign networks where message
+delivery might fail. This proposal extends Matrix to be aware of these error
+cases.
+
+Additionally there might be some operational restrictions of bridges which might
+make it necessary for them to refrain from handling an event, e.g. when hitting
+memory limits. In this case the new event type can be used as well.
+
+This is an example of how the new bridge error might look:
+
+```
+{
+    "type": "m.bridge_error",
+    "content": {
+        "network: "Discord",
+        "affected_users": ["@discord_.*:example.org"],
+        "reason": "m.bridge_unavailable",
+        "time_to_permanent": 900,
+        "m.relationship": {
+            "rel_type": "m.reference",
+            "event_id": "$some:event.id"
+        }
+    }
+}
+```
+
+\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\
+¹ Or similar – see [Security Considerations](#security-considerations)
+
+### Retries and error revocation
+
+Providing a way to retry a failed message delivery gives the sender control over
+the importance of her message. An extra procedure for a retry is necessary as
+the message might have been delivered to some users (those not on the bridge)
+and this would produce duplicate messages for them.
+
+A retry request is posted by the client to the room for all bridges to see it,
+referencing the original event. By inspecting the sender of all related
+`m.bridge_error` events, under all bridges the correct one can find out that it
+is responsible. The responsible bridge re-fetches the original event and retries
+to deliver it.
+
+A successful retry should be communicated by revoking (not redacting) the
+original error that made the retry necessary. Revocation is done by an event
+with the type `m.bridge_error_revoke` which references the original event. The
+error(s) having a sender of the same bridge as the revocation event are
+considered revoked. Clients can show a revocation message e.g. as “Delivered to
+Discord at 14:52.” besides the original event.
+
+On an unsuccessful retry the bridge may edit the error's content to reflect the
+new state, e.g. because the type of error changed or to communicate the new
+time.
+
+Example of the new retry events:
+
+```
+{
+    "type": "m.bridge_retry",
+    "content": {
+        "m.relationship": {
+            "rel_type": "m.reference",
+            "event_id": "$original:event.id"
+        }
+    }
+}
+```
+
+```
+{
+    "type": "m.bridge_error_revoke",
+    "content": {
+        "m.relationship": {
+            "rel_type": "m.reference",
+            "event_id": "$original:event.id"
+        }
+    }
+}
+```
+
+Overview of the relations between the different event types:
+
+```
+                           m.references
+ ________________     _____________________
+|                |   |                     |
+| Original Event |-+-|    Bridge Error     |
+|________________| | |_____________________|
+                   |  _____________________
+                   | |                     |
+                   +-|    Retry Request    |
+                   | |_____________________|
+                   |  _____________________
+                   | |                     |
+                   +-| Bridge Error Revoke |
+                     |_____________________|
+```
+
+A retry might not make much sense for every kind of error e.g. retrying
+`m.unknown_event` will probably result in the same error again. Clients may
+choose to disable retry options for those cases, but it is not restricted
+otherwise.
+
+### Special case: Unavailable bridge
+
+In the case the bridge is down or otherwise disconnected from the homeserver, it
+naturally has no way to inform its users about the unavailability. In this case
+the homeserver can stand in as an agent for the bridge and answer requests in
+its absence.
+
+For this to happen, the homeserver will send out a bridge error event in the
+moment a transaction delivery to the bridge failed. The clients at this point
+will start showing an error. When the bridge comes back online it will encounter
+a higher-than-normal load as all events accumulated over the downtime are
+flooding in. To handle this scenario well, the bridge will want to simply
+discard all messages older than a given threshold and not bother with sending
+any answer back.
+
+By including a timeout in the `time_to_permanent` field of the event, the client
+will know without further feedback from the homeserver or bridge when the
+message won't be delivered anymore.
+
+For those events still accepted by the bridge, the error must be revoked by a
+`m.bridge_error_revoke` as described in the previous chapter.
+
+**Note:** For this to work, the homeserver is required to impersonate a user of
+the bridge as it has no agent of its own. The impersonated user would be the
+bridge bot user or one of the virtual users in the bridge's namespace.
+
+### Rights management
+
+Only bridges should be allowed to send bridge errors and revocations.
+
+Utilizing the rights system of the room provides a good approximation to this
+behavior. It is fine to use it under the assumptions that
+
+- `m.bridge_error` and `m.bridge_error_revoke` require admin power levels.
+- there is always the bridge bot user or a virtual user in the bridge's
+  namespace present in the room.
+- at least one of those users possesses admin power level.
+- all users with admin power levels are trusted.
+
+In short, this requires giving bridges admin power levels in a room and trusting
+them to restrict their actions to their own business. It is enough to have one
+privileged bridge user in the room. In public rooms this is most commonly the
+bridge bot user with admin power level available and in 1:1 conversations it is
+the puppeted conversation partner which does generally have admin power levels
+as well.
+
+As long as the above assumptions are met, it is fine to not explicitly denote
+bridges and bridge users as such and simply rely on the power levels for access
+control to the new events.
+
+An alternative for the above solution is the adoption of [MSC 1410: Rich
+Bridging](https://github.com/matrix-org/matrix-doc/issues/1410). It stores
+information about users affiliation to a bridge in the room state. Instead of
+checking power levels of users, rich bridging can be utilized by checking the
+room state and only allow valid representatives of the bridge to send bridge
+errors and their revocations. This alternative has the advantage of not
+requiring agents of the bridge to be powerful. They would be verifiable and
+could be trusted without any restrictions regarding their power levels.
+
+## Tradeoffs
+
+Without this proposal, bridges could still inform users in a room that a
+delivery failed by simply sending a plain message event from a bot account. This
+possibility carries the disadvantage of conveying no special semantic meaning
+with the consequence of clients not being able to adapt their presentation.
+
+A fixed set of error types might be too restrictive to express every possible
+condition. An alternative would be a free-form text for an error message. This
+brings the problems of less semantic meaning and a requirement for
+internationalization with it. In this proposal a generic error type is provided
+for error cases not considered in this MSC.
+
+The nature of a retry request from a client to the bridge lends it more to an
+ephemeral type of transport than something permanent like a PDU, but it was
+advised against it for The Spec doesn't make implementations of new EDU types
+easy. Applications Services in general don't allow listening to EDUs, so further
+changes to The Spec would be necessary before following the probably more
+appropriate route here.
+
+A new event type `m.bridge_error_revoke` is introduced for revoking a bridge
+error. Alternatively it could be considered to redact the bridge error event,
+which would eliminate the need for the revocation event and would make this
+proposal a little simpler. The disadvantage of this approach is the missing
+transparency and context of who had which information at which point in time.
+This additional information should make for a better user experience.
+
+## Potential issues
+
+When the foreign network is not the cause of the error signaled but the bridge
+itself (maybe under load), there might be an argument that responding to failed
+messages increases the pressure.
+
+## Security considerations
+
+Sending a custom regex with an event might open the doors for attacking a
+homeserver and/or a client by exposing a direct pathway to the complex code of a
+regex parser. Additionally sending arbitrary complex regexes might make Matrix
+more vulnerable to DoS attacks. To mitigate these risks it might be sensible to
+only allow a more restricted subset of regular expressions by e.g. requiring a
+maximal length or falling back to simple globbing.
+
+When utilizing power levels instead of building on [MSC 1410: Rich
+Bridging](https://github.com/matrix-org/matrix-doc/issues/1410) a malicious user
+who has enough power to send `m.bridge_error` or `m.bridge_error_revoke` is able
+to impersonate a bridge. She will be able to wrongly mark messages as failed to
+deliver or revoke errors when they were not successfully retried.
+
+## Conclusion
+
+In this document an event is proposed for bridges to signal errors and a way to
+retry and revoke those errors. The event informs the affected room about which
+message errored for which reason; it gives information about the affected users
+and the bridged network. By implementing the proposal Matrix users will get more
+insight into the state of their (un)delivered messages and thus they will become
+less frustrated.