Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix up logic for delaying sending read receipts over federation. #17933

Merged
merged 3 commits into from
Nov 25, 2024

Conversation

erikjohnston
Copy link
Member

@erikjohnston erikjohnston commented Nov 14, 2024

For context of why we delay read receipts, see matrix-org/synapse#4730.

Element Web often sends read receipts in quick succession, if it reloads the timeline it'll send one for the last message in the old timeline and again for the last message in the new timeline. This caused remote users to see a read receipt for older messages come through quickly, but then the second read receipt taking a while to arrive for the most recent message.

There are two things going on in this PR:

  1. There was a mismatch between seconds and milliseconds, and so we ended up delaying for far longer than intended.
  2. Changing the logic to reuse the DestinationWakeupQueue (used for presence)

The changes in logic are:

  • Treat the first receipt and subsequent receipts in a room in the same way
  • Whitelist certain classes of receipts to never delay being sent, i.e. receipts in small rooms, receipts for events that were sent within the last 60s, and sending receipts to the event sender's server.
  • The maximum delay a receipt can have before being sent to a server is 30s, and we'll send out receipts to remotes at least at 50Hz (by default)

The upshot is that this should make receipts feel more snappy over federation.

This new logic should send roughly between 10%–20% of transactions immediately on matrix.org.

SQL query and output

Looking at the last 100,000 local receipts, we look at the number of servers we need to send those receipts to and bucket them into different classes. First, if they are in a small room (i.e. less than 5 domains), and then by age of event.

     delay     |  sum   
---------------+--------
 00 small room |   2688
 01 <1s        |   1476
 05 <5s        |   4967
 10 <10s       |   1643
 60 <60s       |   9052
 old           | 129735
SELECT
  delay,
  SUM (to_send) 
FROM
  (
    SELECT
      CASE
        WHEN
          domains - 1 < 5 
        THEN
          '00 small room' 
        WHEN
          read_ts - received_ts < '1s' 
        THEN
          '01 <1s' 
        WHEN
          read_ts - received_ts < '5s' 
        THEN
          '05 <5s' 
        WHEN
          read_ts - received_ts < '10s' 
        THEN
          '10 <10s' 
        WHEN
          read_ts - received_ts < '60s' 
        THEN
          '60 <60s' 
        ELSE
          'old' 
      END
      AS delay, domains - 1 AS to_send 
    FROM
      (
        SELECT
          r.room_id,
          event_id,
          to_timestamp((data::json ->> 'ts')::BIGINT / 1000) AS read_ts,
          to_timestamp(received_ts / 1000) AS received_ts 
        FROM
          receipts_linearized AS r 
          INNER JOIN
            events USING (event_id) 
        WHERE
          r.user_id LIKE '%:matrix.org' 
        ORDER BY
          stream_id DESC LIMIT 10000
      )
      AS r2 
      INNER JOIN
        lateral (
        SELECT
          COUNT(DISTINCT SUBSTRING(state_key FROM '@[^:]*:(.*)$')) AS domains 
        FROM
          current_state_events 
        WHERE
          type = 'm.room.member' 
          AND room_id = r2.room_id 
          AND membership = 'join') AS c 
          ON TRUE 
        WHERE
          domains > 1
  )
  AS a 
GROUP BY
  delay 
ORDER BY
  delay;

@erikjohnston erikjohnston force-pushed the erikj/delayed_receipts branch from 0bb3b58 to 801983c Compare November 14, 2024 18:14
@erikjohnston erikjohnston marked this pull request as ready for review November 14, 2024 19:48
@erikjohnston erikjohnston requested a review from a team as a code owner November 14, 2024 19:48
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to make sense

# destination is registered for the flush
if queues_pending_flush is not None:
queues_pending_flush.add(queue)
if self.clock.time_msec() - metadata.received_ts < 60_000:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a silly case, but maybe

Suggested change
if self.clock.time_msec() - metadata.received_ts < 60_000:
if 0 <= self.clock.time_msec() - metadata.received_ts < 60_000:

to guard against the case where someone's clock was wrong (in the future) for a while, or is currently in the past, from causing them to treat all events as recent?

Or perhaps it should have some tolerance for workers being out of sync, so perhaps -60_000 <=???...

self.reactor.advance(10)
mock_send_transaction.assert_called_once()
json_cb = mock_send_transaction.call_args[0][1]
transaction = mock_send_transaction.call_args_list[0][0][0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
transaction = mock_send_transaction.call_args_list[0][0][0]
transaction = mock_send_transaction.call_args_list[0].args[0]

I think this is correct and clearer? (ditto for other assertions)


# expect a call to send_transaction for each host
self.assertEqual(mock_send_transaction.call_count, 20)
self._assert_edu_in_call(mock_send_transaction.call_args[0][1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also for these

Suggested change
self._assert_edu_in_call(mock_send_transaction.call_args[0][1])
self._assert_edu_in_call(mock_send_transaction.call_args.args[1])

)

@erikjohnston erikjohnston enabled auto-merge (squash) November 25, 2024 17:45
@erikjohnston erikjohnston merged commit 3943d2f into develop Nov 25, 2024
39 checks passed
@erikjohnston erikjohnston deleted the erikj/delayed_receipts branch November 25, 2024 18:12
MatMaul pushed a commit to MatMaul/synapse that referenced this pull request Dec 4, 2024
…ment-hq#17933)

For context of why we delay read receipts, see
matrix-org/synapse#4730.

Element Web often sends read receipts in quick succession, if it reloads
the timeline it'll send one for the last message in the old timeline and
again for the last message in the new timeline. This caused remote users
to see a read receipt for older messages come through quickly, but then
the second read receipt taking a while to arrive for the most recent
message.

There are two things going on in this PR:
1. There was a mismatch between seconds and milliseconds, and so we
ended up delaying for far longer than intended.
2. Changing the logic to reuse the `DestinationWakeupQueue` (used for
presence)

The changes in logic are:
- Treat the first receipt and subsequent receipts in a room in the same
way
- Whitelist certain classes of receipts to never delay being sent, i.e.
receipts in small rooms, receipts for events that were sent within the
last 60s, and sending receipts to the event sender's server.
- The maximum delay a receipt can have before being sent to a server is
30s, and we'll send out receipts to remotes at least at 50Hz (by
default)

The upshot is that this should make receipts feel more snappy over
federation.

This new logic should send roughly between 10%–20% of transactions
immediately on matrix.org.
yingziwu added a commit to yingziwu/synapse that referenced this pull request Dec 20, 2024
This release contains the security fixes from [v1.120.2](https://github.com/element-hq/synapse/releases/tag/v1.120.2).

- Fix release process to not create duplicate releases. ([\#18025](element-hq/synapse#18025))

- Support for [MSC4190](matrix-org/matrix-spec-proposals#4190): device management for Application Services. ([\#17705](element-hq/synapse#17705))
- Update [MSC4186](matrix-org/matrix-spec-proposals#4186) Sliding Sync to include invite, ban, kick, targets when `$LAZY`-loading room members. ([\#17947](element-hq/synapse#17947))
- Use stable `M_USER_LOCKED` error code for locked accounts, as per [Matrix 1.12](https://spec.matrix.org/v1.12/client-server-api/#account-locking). ([\#17965](element-hq/synapse#17965))
- [MSC4076](matrix-org/matrix-spec-proposals#4076): Add `disable_badge_count` to pusher configuration. ([\#17975](element-hq/synapse#17975))

- Fix long-standing bug where read receipts could get overly delayed being sent over federation. ([\#17933](element-hq/synapse#17933))

- Add OIDC example configuration for Forgejo (fork of Gitea). ([\#17872](element-hq/synapse#17872))
- Link to element-docker-demo from contrib/docker*. ([\#17953](element-hq/synapse#17953))

- [MSC4108](matrix-org/matrix-spec-proposals#4108): Add a `Content-Type` header on the `PUT` response to work around a faulty behavior in some caching reverse proxies. ([\#17253](element-hq/synapse#17253))
- Fix incorrect comment in new schema delta. ([\#17936](element-hq/synapse#17936))
- Raise setuptools_rust version cap to 1.10.2. ([\#17944](element-hq/synapse#17944))
- Enable encrypted appservice related experimental features in the complement docker image. ([\#17945](element-hq/synapse#17945))
- Return whether the user is suspended when querying the user account in the Admin API. ([\#17952](element-hq/synapse#17952))
- Fix new scheduled tasks jumping the queue. ([\#17962](element-hq/synapse#17962))
- Bump pyo3 and dependencies to v0.23.2. ([\#17966](element-hq/synapse#17966))
- Update setuptools-rust and fix building abi3 wheels in latest version. ([\#17969](element-hq/synapse#17969))
- Consolidate SSO redirects through `/_matrix/client/v3/login/sso/redirect(/{idpId})`. ([\#17972](element-hq/synapse#17972))
- Fix Docker and Complement config to be able to use `public_baseurl`. ([\#17986](element-hq/synapse#17986))
- Fix building wheels for MacOS which was temporarily disabled in Synapse 1.120.2. ([\#17993](element-hq/synapse#17993))
- Fix release process to not create duplicate releases. ([\#17970](element-hq/synapse#17970), [\#17995](element-hq/synapse#17995))

* Bump bytes from 1.8.0 to 1.9.0. ([\#17982](element-hq/synapse#17982))
* Bump pysaml2 from 7.3.1 to 7.5.0. ([\#17978](element-hq/synapse#17978))
* Bump serde_json from 1.0.132 to 1.0.133. ([\#17939](element-hq/synapse#17939))
* Bump tomli from 2.0.2 to 2.1.0. ([\#17959](element-hq/synapse#17959))
* Bump tomli from 2.1.0 to 2.2.1. ([\#17979](element-hq/synapse#17979))
* Bump tornado from 6.4.1 to 6.4.2. ([\#17955](element-hq/synapse#17955))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants