Backoffs and 429s on /keys/claim over federation causes UTDs #17267

kegsay · 2024-06-05T13:14:15Z

Description

Debugging a UTD (rageshake) and the cause of this appears to be /keys/claim failing with:

failures={"connecteu.rs": Object {"message": String("Not ready for retry"), "status": Number(503)}, "sw1v.org": Object {"message": String("Not ready for retry"), "status": Number(503)}}

This happens again 40 minutes later, which feels very wrong if the retry period is 40mins+.

A long retry period like this will cause UTDs because the sender cannot claim the OTK for one or more of the device's recipients.

The error message originates here which is called from here for claiming keys. This calls through to the transport layer which does post_json which shows it can throw NotRetryingDestination.

It seems to be thrown in get_retry_limiter here. The retry interval controls the duration, which seems to be persisted. This is loaded here and is modified according to:

                self.retry_interval = int(
                    self.retry_interval
                    * self.destination_retry_multiplier
                    * random.uniform(0.8, 1.4)
                )

The retry multiplier is a configurable value and retry interval defaults to the min: self.retry_interval = self.destination_min_retry_interval_ms. So what's matrix.org's config?

Steps to reproduce

I'm guessing:

make your server go offline for a bit
come back online
have it not retry for ages?

Homeserver

matrix.org

Synapse Version

Whatever was running on May 30

Installation Method

I don't know

Database

postgres

Workers

Multiple workers

Platform

?

Configuration

No response

Relevant log output

client-side:

`2024-05-30 09:29:57.003 Element[7064:2329509] [MXCryptoSDK] DEBUG receive_keys_claim_response ... failures={"connecteu.rs": Object {"message": String("Not ready for retry"), "status": Number(503)}, "sw1v.org": Object {"message": String("Not ready for retry"), "status": Number(503)}}`

Anything else that would be useful to know?

Proposed solution here would be to ignore the backoff for /keys/claim requests, as if they fail it will definitely cause a UTD. If we don't want to do that, having a suitably low retry period (capped in the order of minutes) could be a viable alternative.

Alternatively, I had assumed that Synapse cleared backoffs when the other HS sent something to matrix.org..? Surely this would have happened here?

The text was updated successfully, but these errors were encountered:

richvdh · 2024-06-05T14:21:44Z

I think this is a duplicate of #8917

richvdh · 2024-06-06T13:36:47Z

Also: it might be better described as "backoff period" than "retry period", since synapse doesn't itself initiate any retries for /keys/claim ?

kegsay · 2024-06-07T08:34:12Z

I've seen this happen again but this time between element.io <--> matrix.org, where the client on element.io saw in response to /keys/claim: "matrix.org": Object {"message": String("Failed to send request: HttpResponseException: 429: Too Many Requests"), "status": Number(503.0)}

Rate limiting this endpoint feels suboptimal...

kegsay · 2024-06-07T10:56:34Z

matrix-org/matrix-spec-proposals#4081 would be a solution to this because then the sending server could just serve up the fallback key if it cannot talk to the recipient server.

ara4n · 2024-06-07T17:52:06Z

@kegsay pretty sure this isn't a 429 but really a 503 (see the status field), which matrix.org's haproxy turns into a 429 because if a single endpoint reports 503 to CF then CF will mark the whole haproxy as down, taking out the whole service. It's a horrible hack, given it's absolutely nothing to do with rate limiting; i suspect 429 was chosen as a code which would encourage retries without the backend being marked as down.

andybalaam · 2024-06-10T13:53:39Z

Crypto are not actively working on this because the best solution would be to do matrix-org/matrix-spec-proposals#4081

kegsay · 2024-07-11T15:20:59Z

Another one, element.io <-> matrix.org failing with a different error: failures={"matrix.org": Object {"status": Number(503), "message": String("Failed to send request: TimeoutError: Timed out after 10s")}}

kegsay · 2024-07-12T09:12:28Z

This happened again in a large E2EE room. The failure mode was subtly different though because /keys/claim did eventually succeed for a 2nd+ message, so the error message was The message was encrypted using an unknown message index, first known index 1, index of the message 0 rather than not having the keys at all.

richvdh · 2024-10-17T14:14:19Z

It's unclear to me how this is different from element-hq/element-meta#2154, which covers the implementation of MSC4081.

Closing for now, unless someone can clarify

kegsay added A-E2EE T-Defect z-uisi labels Jun 5, 2024

kegsay changed the title ~~Retry period for /keys/claim over federation is too long~~ Backoff and 429s on /keys/claim over federation causes UTDs Jun 7, 2024

kegsay changed the title ~~Backoff and 429s on /keys/claim over federation causes UTDs~~ Backoffs and 429s on /keys/claim over federation causes UTDs Jun 7, 2024

kegsay mentioned this issue Jun 7, 2024

MSC4081: Eagerly sharing fallback keys with federated servers matrix-org/matrix-spec-proposals#4081

Open

anoadragon453 added the O-Occasional label Jun 7, 2024

MarcWadai mentioned this issue Jul 9, 2024

Can we have more info about why messages are undecrypted ? tchapgouv/tchap-web-v4#904

Open

richvdh closed this as completed Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backoffs and 429s on /keys/claim over federation causes UTDs #17267

Backoffs and 429s on /keys/claim over federation causes UTDs #17267

kegsay commented Jun 5, 2024 •

edited

Loading

richvdh commented Jun 5, 2024

richvdh commented Jun 6, 2024

kegsay commented Jun 7, 2024

kegsay commented Jun 7, 2024 •

edited

Loading

ara4n commented Jun 7, 2024

andybalaam commented Jun 10, 2024

kegsay commented Jul 11, 2024

kegsay commented Jul 12, 2024

richvdh commented Oct 17, 2024

Backoffs and 429s on /keys/claim over federation causes UTDs #17267

Backoffs and 429s on /keys/claim over federation causes UTDs #17267

Comments

kegsay commented Jun 5, 2024 • edited Loading

Description

Steps to reproduce

Homeserver

Synapse Version

Installation Method

Database

Workers

Platform

Configuration

Relevant log output

Anything else that would be useful to know?

richvdh commented Jun 5, 2024

richvdh commented Jun 6, 2024

kegsay commented Jun 7, 2024

kegsay commented Jun 7, 2024 • edited Loading

ara4n commented Jun 7, 2024

andybalaam commented Jun 10, 2024

kegsay commented Jul 11, 2024

kegsay commented Jul 12, 2024

richvdh commented Oct 17, 2024

kegsay commented Jun 5, 2024 •

edited

Loading

kegsay commented Jun 7, 2024 •

edited

Loading