-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC2162: Signaling Errors at Bridges #2162
base: old_master
Are you sure you want to change the base?
Conversation
Signed-off-by: Kai A. Hiller <[email protected]>
Is the somewhat dangerous |
Signed-off-by: Kai A. Hiller <[email protected]>
It depends on what is deemed necessary. You are right that it is the main objective to flag a message. That gives the user the information that something went wrong and adding the network name, the reason or the affected users adds more information to that. A key question is what helps a user and what does not. Adding this general event type helps definitively imo as it shows a previously invisible problem. The question now is how a user can act upon that and this should determine what is included and what not. Here is what came to my mind of what might be a user's reaction:
(Maybe an informed user is a goal in itself as well.) By leaving out the ¹ On a semantic level anyway. A user might be able to guess from the network name and the peoples nick name endings e.g. |
I agree that it's useful to be able to tell which users in a room are there via a bridge. It might make more sense for that to be handled separately from error reporting, e.g. as done by the widely deployed IRC flair. |
`m.bridge_error`. It is sent by the bridge and references an event previously | ||
sent in a room, by that marking it as “failed to deliver” for all users of a | ||
bridge. The new event type utilizes reference aggregations ([MSC | ||
1849](https://github.com/matrix-org/matrix-doc/blob/matthew/msc1849/proposals/1849-aggregations.md)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up that @ara4n is rewriting this proposal, but I don't think it will affect this one https://github.com/matrix-org/matrix-doc/pull/2154/files
That might be enough. Just wanted to note that this is slightly ambiguous as there might be more than one IRC bridge in a room. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable so far. Before it can be accepted, details about the regex format would need to be finalized.
|
||
Additional information contained in the event are the name of the bridged | ||
network (e.g. “Discord” or “Telegram”) and a regex¹ describing the affected | ||
users (e.g. `@discord_*:example.org`). This regex should be similar to the one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a regex, then it should be @discord_.*:example.org
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to jump in and say my preference is to keep it as a simple glob expression, rather than having to define the rules around regexes. I doubt we plan to do more than simple matching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to define what kind of glob it is then, as we use two types in Matrix. Also if it's not a regex, don't call it a regex.
Being able to reuse the namespace from the registration is a strong argument for supporting proper regex here (which is better defined than globs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(which is better defined than globs).
FWIW, my impression is the opposite. There are several different syntaxes for regexes. But personally, I don't really care which one is picked, as long as it's well defined.
When the foreign network is not the cause of the error signaled but the bridge | ||
itself (maybe under load), there might be an argument that responding to failed | ||
messages increases the pressure. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential issue is that this doesn't convey any error information if messages failed to send due to the bridge being down completely (as the bridge is unable to send the error messages).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the bridge comes back online, it will receive the missed events from the HS, so they might be handled after all. This would be only temporary and by that explicitly not covered by this proposal. The big thing to tackle here would be a mechanism to signal delivery delays which would add to the core Matrix network as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this is something that should be addressed or at least mentioned in the proposal. If the homeserver cannot send the event to the bridge, it should send an error event on its behalf (which the bridge can later redact).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it is mentioned here? https://github.com/matrix-org/matrix-doc/pull/2162/files#diff-3fc0af60441d4268c3ff475f9e03fb4cR192
A more formal/structured scheme for mapping users to bridges might be nice, yeah, but I really think it would be best done independently of error reporting, since it is otherwise useful and avoids a complicated and potentially hazardous feature here. Another problem that comes to mind is spoofing. How can we guarantee that reported errors are genuine? The most obvious answer is "configure power levels such that only the bridge bots or an admin can send the event," but that's not a strong guarantee, and risks misconfiguration. One solution might be to always identify bridges by their matrix-side bridge bot. Then errors sent by the bridge bot are inherently associated with the bridge in question, and things like human-readable network identifiers and the set of affected users can be determined robustly based on that information. |
This was my assumption about how it was supposed to work, and we should codify this in the proposal unless @V02460 has other ideas. I had a proposal a long time ago which mapped bot mxids <=> bridge metadata in the room state. |
In particular, in the presence of such a mechanism I think both the "network" and "affected_users" keys would be best replaced by indirect lookup through that same mechanism, because it provides a single necessarily consistent source of truth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very much in the right direction! It'd be good to also cover why EDUs, state events, or toDevice messages weren't picked as alternatives in the Tradeoffs section.
|
||
Additional information contained in the event are the name of the bridged | ||
network (e.g. “Discord” or “Telegram”) and a regex¹ describing the affected | ||
users (e.g. `@discord_*:example.org`). This regex should be similar to the one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to define what kind of glob it is then, as we use two types in Matrix. Also if it's not a regex, don't call it a regex.
Being able to reuse the namespace from the registration is a strong argument for supporting proper regex here (which is better defined than globs).
by the bridge. It is used as a fallback when there is no other more specific | ||
reason. | ||
|
||
* `m.event_too_old` When the foreign network does not support timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh as a bridge operator I'd rather drop the message on the floor than bother with signalling this error. I personally already run my bridges such that anything older than 15 minutes doesn't even get sent to the bridges to reduce traffic, useless processing, and context in the foreign network. Although there's the occasional question about why some messages don't get sent, I don't think it's worth spending the homeserver's time to advertise the failure.
Given these errors are associated with individual events, and time differences between events usually only happen after a major homeserver problem, this is just asking for a self-inflicted denial of service when the bridge is supposed to send thousands of errors into rooms. It's not very often that one or two events are too old - most of the time it's batches of hours worth of conversation.
tldr: I don't think we need this error code as it leads to self-inflicted denial of service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mixed on this, I absolutely agree with @turt2live that this definitely going to explode things though I do also see the benefit in signalling why a bunch of messages were never sent. The best thing I can come up with is a special event that defines the boundaries between a bridge stopping the processing of events, and beginning.
However, I don't think you can just specify a start and end point in terms of the dag and you'd need to list every missed event :/
When the foreign network is not the cause of the error signaled but the bridge | ||
itself (maybe under load), there might be an argument that responding to failed | ||
messages increases the pressure. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this is something that should be addressed or at least mentioned in the proposal. If the homeserver cannot send the event to the bridge, it should send an error event on its behalf (which the bridge can later redact).
enables bridges to communicate that something went wrong and gives clients the | ||
option to give feedback to their users. | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal does not cover how bridges de-flag errors (eventual success in sending a message). I am assuming they redact their original error event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this proposal is only handling the case of when the bridge gives up trying to send.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds sub-par tbh. We'd need a retry mechanism so that users aren't left stranded, or at the very least support redaction as a way to indicate clearing of the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a good chance that a message will eventually be delivered, I don't think it belongs in this proposal. We should try to unify that case with a general “delivery delay notification” solution for the whole Matrix universe so the work has to be done only once. I am currently writing a bit about what I have in mind about those “delivery delay notifications” and there can discussion about that as well. (Also not quite sure where to have it then.) In the case of a message not being delivered with a high probability and just backing off in rare circumstances, redacting a permanent error might be adequate.
Until now I assumed the error is final and there is no retry, just a manual resend of the message. Could a bridge get the redaction and refetch the original event? Or might it be possible to simulate a resend with a no-op edit? If there is no satisfying way already, one could of course add another event type which is ignored by everyone but the bridge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Homeservers are expected to keep retrying the appservice until it comes back alive, but that can easily be hours or even days before the service responds. Most bridges nowadays have a condition for received messages where ti just drops messages which are too old, but between the time the bridge went down and the time the message was ignored the user's message was not delivered without notification.
Limiting the scope of the proposal to just fatal errors doesn't really help with communicating the bridge's status because there's many more temporary failures that people expect to hear about due to the nature of realtime communications. It's bad enough we already get complaints when it takes more than 10 seconds to send a message through 4 different points of failure to the remote network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am coming round to the idea of sending a temporary failure PDU for things which have failed to send and are in a retry queue of some kind. Redacting that would imply it's been sent.
Separately there is a question of if this proposal should cover how the user can indicate they want to retry a message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully coming around enough to give the OK 😇
I'd be uncomfortable with this going into the spec if it only communicated permanent failures, because permanent failures are rare.
enables bridges to communicate that something went wrong and gives clients the | ||
option to give feedback to their users. | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal doesn't explain who sends the error event. Is it the sender_localpart
? a user in the namespace? Both of these options have consequences which make them non-ideal: the sender_localpart
might not be in the room in the case of a puppet bridge and a user in the namespace is an undefined thing: anyone can claim they are in a namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing we want here is that “the bridge” does send the message, which is not a concept that maps straight to Matrix afaik. Instead we always need a proxy user for the bridge. There are two parts to get this right: Who is eligible to represent the bridge and how to make sure this info came from the bridge? This maps to the problems of authorization and authentication. @Half-Shot mentioned he had a proposal for this via the room state, so it might be a good idea to piggyback on that. (If it is a proposal I assume there is nothing else usable for us out there.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm failing to see the correlation between this and authorization for bridges (I also don't know what proposal that is). Bridges have a namespace of users and a sender_localpart
: which user is supposed to send it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One user representing the bridge does send the message. It depends on who is in the room, so the answer is both of them.
We can't simply say the bridge bot user as it is sometimes not joined e.g. in 1:1 conversations. Then the virtual user of your communication partner does represent the bridge and it should send the bridge error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes problems (as does the regex later on) because clients won't be able to do sanity checking on errors. They don't have a concept of bridges or appservices, and would be unable to see that @alice:example.org
doesn't have appropriate permissions over the Freenode IRC bridge.
enables bridges to communicate that something went wrong and gives clients the | ||
option to give feedback to their users. | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal should cover how much we care about random users impersonating bridges or bridges lying about their namespaces, and how we protect against that if we do care (we should).
enables bridges to communicate that something went wrong and gives clients the | ||
option to give feedback to their users. | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am coming round to the idea of sending a temporary failure PDU for things which have failed to send and are in a retry queue of some kind. Redacting that would imply it's been sent.
Separately there is a question of if this proposal should cover how the user can indicate they want to retry a message.
by the bridge. It is used as a fallback when there is no other more specific | ||
reason. | ||
|
||
* `m.event_too_old` When the foreign network does not support timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mixed on this, I absolutely agree with @turt2live that this definitely going to explode things though I do also see the benefit in signalling why a bunch of messages were never sent. The best thing I can come up with is a special event that defines the boundaries between a bridge stopping the processing of events, and beginning.
However, I don't think you can just specify a start and end point in terms of the dag and you'd need to list every missed event :/
Signed-off-by: Kai A. Hiller <[email protected]>
Signed-off-by: Kai A. Hiller <[email protected]>
Signed-off-by: Kai A. Hiller <[email protected]>
Signed-off-by: Kai A. Hiller <[email protected]>
Signed-off-by: Kai A. Hiller <[email protected]>
Signed-off-by: Kai A. Hiller <[email protected]>
Signed-off-by: Kai A. Hiller <[email protected]>
The need for this proposal arises from a gap between the Matrix network and | ||
other foreign networks it bridges to. Matrix with its eventual consistency is | ||
unique in having a message delivery guarantee. Because of this property there is | ||
no need in the Matrix network itself to model the failure of message delivery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably de-scope this for the sake of this MSC landing any time soon.
The need for this proposal arises from a gap between the Matrix network and | ||
other foreign networks it bridges to. Matrix with its eventual consistency is | ||
unique in having a message delivery guarantee. Because of this property there is | ||
no need in the Matrix network itself to model the failure of message delivery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we are de-scoping, the MSC text needs to be updated to reflect that and the edge cases that, like @turt2live mentioned, can still be encountered.
When the foreign network is not the cause of the error signaled but the bridge | ||
itself (maybe under load), there might be an argument that responding to failed | ||
messages increases the pressure. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it is mentioned here? https://github.com/matrix-org/matrix-doc/pull/2162/files#diff-3fc0af60441d4268c3ff475f9e03fb4cR192
regex parser. Additionally sending arbitrary complex regexes might make Matrix | ||
more vulnerable to DoS attacks. To mitigate these risks it might be sensible to | ||
only allow a more restricted subset of regular expressions by e.g. requiring a | ||
maximal length or falling back to simple globbing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be worked out in this MSC, as we'll state in the spec whether a field supports full regex or only simple globbing.
@Half-Shot would a bridge ever need more than globbing for calling out affected users? Currently application service registration allows for full regex parsing (https://matrix.org/docs/spec/application_service/unstable#registration). But this is on the bridge side, and thus if it kills the homeserver, it was the homeserver operator that was at fault for using a bad registration file. Things are entirely different from the C-S API side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If using globbing, there would be a conversion needed from the AS regex, which it should be based on. As the regex language is more powerful than globbing, some simplifications/hacks/heuristics are required there. Or having the bridge user add it in two different forms manually…
Co-Authored-By: Andrew Morgan <[email protected]>
Signed-off-by: Kai A. Hiller <[email protected]>
Signed-off-by: Kai A. Hiller <[email protected]>
Signed-off-by: Kai A. Hiller <[email protected]>
Signed-off-by: Kai A. Hiller <[email protected]>
Signed-off-by: Kai A. Hiller <[email protected]>
Was there any interest in a more robust and rigorous approach to determining affected users, as previously discussed? |
Signed-off-by: Kai A. Hiller <[email protected]>
Signed-off-by: Kai A. Hiller <[email protected]>
I added notes about MSC 1410: Rich Bridging now, which provides exactly that. If it were already in The Spec, it would definitively be what should be used and this is the nicer way to approach it imo. The obvious hurdle here is that the MSC is not accepted yet and will probably need some more work. Currently I am ranking the two benefits we would get from it as moderately important, but want to hear if others agree with my assessment. A way forward could be to implement the current behavior and swap out the parts which rely on MSC 1410 when it is ready. I think adding the new changes would be backwards compatible, so they could be just tucked on later. |
between informing the user and preventing unnecessary spam. Throwing this | ||
error only for some subtypes of an event is fine. | ||
|
||
* `m.bridge_unavailable` The homeserver couldn't reach the bridge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a subclass to this which has shown to be problematic in recent weeks: When the homeserver is also dead, the users on other homeservers will see the message as delivered when in fact it is not.
I don't know if it makes total sense here given the traffic concern, but maybe flipping this proposal around for positive reactions to messages when they are delivered? Maybe a new kind of m.receipt
for this specific purpose.
or maybe we train the general public that the bridge sending a read receipt is fine?
Presumably these ideas have already been covered, so I'm curious as to what the decisions were that led to it not being used.
* `m.no_permission` The bridge wanted to handle an event, but didn't have the | ||
permission to do so. | ||
|
||
The bridge error can provide a `time_to_permanent` field. If this field is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or permanent: true
instead?
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\ | ||
¹ Or similar – see [Security Considerations](#security-considerations) | ||
|
||
### Retries and error revocation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retries might be better suited for a dedicated MSC given the complexity.
Utilizing the rights system of the room provides a good approximation to this | ||
behavior. It is fine to use it under the assumptions that | ||
|
||
- `m.bridge_error` and `m.bridge_error_revoke` require admin power levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by default they don't, unless you are expecting this to go into a whole new room version (which is a much harder sell)
behavior. It is fine to use it under the assumptions that | ||
|
||
- `m.bridge_error` and `m.bridge_error_revoke` require admin power levels. | ||
- there is always the bridge bot user or a virtual user in the bridge's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clients cannot check this
In short, this requires giving bridges admin power levels in a room and trusting | ||
them to restrict their actions to their own business. It is enough to have one | ||
privileged bridge user in the room. In public rooms this is most commonly the | ||
bridge bot user with admin power level available and in 1:1 conversations it is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More commonly the bridge does not have any kind of power in the room. When bridges are admins, they are often added through Scalar which makes this decision for them - the bridges themselves do not acquire power to operate. There's also plenty of bridges which are not represented in Scalar, which has lead to a majority of rooms not having appropriate permissions for all bridges.
There is no need for a new endpoint as the existing `/send` endpoint will be | ||
utilized. | ||
|
||
Additional information contained in the event are the name of the bridged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can actually drop this now, when #2346 gets merged :)
Add a new room event for signaling permanent errors occurring at bridges. References MSC1849.
Rendered