Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Sep 15, 2022

Stop getting missing prev_events after we already know their signature is invalid. While #13635 stops us from doing the slow thing after we've already done it once, this PR stops us from doing one of the slow things in the first place.

Related to

Part of #13356

Follow-up to #13815 which tracks event signature failures.

With this PR, we avoid the call to the costly _get_state_ids_after_missing_prev_event because the signature failure will count as an attempt before and we filter events based on the backoff before calling _get_state_ids_after_missing_prev_event now.

For example, this will save us 156s out of the 185s total that this matrix.org /messages request. If you want to see the full Jaeger trace of this, you can drag and drop this trace.json into your own Jaeger, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761

Simplified Jaeger trace of 185.9s /messages request

To explain this exact scenario around /messages -> backfill, we call /backfill and first check the signatures of the 100 events. We see bad signature for $luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M and $zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references $luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M as a prev_event. So we have to do the whole _get_state_ids_after_missing_prev_event rigmarole which pulls in those same events which fail again because the signatures are still invalid.

  • backfill
    • outgoing-federation-request /backfill
    • _check_sigs_and_hash_and_fetch
      • _check_sigs_and_hash_and_fetch_one for each event received over backfill
        • $luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M fails with Signature on retrieved event was invalid.: unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)
        • $zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw fails with Signature on retrieved event was invalid.: unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)
    • _process_pulled_events
      • _process_pulled_event for each validated event
        • ❗ Event $Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0 references $luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M as a prev_event which is missing so we try to get it
          • _get_state_ids_after_missing_prev_event
            • outgoing-federation-request /state_ids
            • get_pdu for $luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M which fails the signature check again
            • get_pdu for $zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw which fails the signature check

Dev notes

SYNAPSE_POSTGRES=1 SYNAPSE_TEST_LOG_LEVEL=INFO python -m twisted.trial tests.handlers.test_federation_event.FederationEventHandlerTests.test_backfill_signature_failure_does_not_fetch_same_prev_event_later

Sytest:

source /Users/eric/Documents/github/element/synapse/.venv/bin/activate
./run-tests.pl --synapse-directory ../synapse tests/48admin.pl
docker run --rm -it -v ~/Documents/github/element/synapse:/src:ro -v ~/Documents/github/element/synapse/logs:/logs -v ~/Documents/github/element/sytest:/sytest:ro matrixdotorg/sytest-synapse:buster tests/48admin.pl
docker run --rm -it -e POSTGRES=1 -e WORKERS=1 \
    -v ~/Documents/github/element/synapse:/src:ro \
    -v ~/Documents/github/element/synapse/logs:/logs \
    -v ~/Documents/github/element/sytest:/sytest:ro matrixdotorg/sytest-synapse:buster tests/48admin.pl

Dump all events in the room with useful debug info

Adapted from matrix-org/matrix-react-sdk#8354 for txn (originally from #12370)

logger.info(
    "asdf_get_debug_events_in_room_ordered_by_depth\n%s",
    self.asdf_get_debug_events_in_room_ordered_by_depth_txn(txn, room_id),
)
asdf_get_debug_events_in_room_ordered_by_depth_txn
    def asdf_get_debug_events_in_room_ordered_by_depth_txn(
        self, txn: LoggingTransaction, room_id: str
    ) -> Any:
        """
        Output all events in the room in a nice table as a string to log

        Based on https://github.com/matrix-org/matrix-react-sdk/pull/8354
        """
        sql = (
            "SELECT depth, stream_ordering, type, state_key, outlier, event_id FROM events"
            " WHERE events.room_id = ?"
            " ORDER BY depth DESC, stream_ordering DESC;"
        )
        txn.execute(
            sql,
            (room_id,),
        )

        headerColumnLengthMap = {
            "depth": 4,
            "stream_ordering": 12,
            "type": 30,
            "state_key": 40,
            "outlier": 4,
            # event_ids are 44 long but we don't need the extra 5 padding
            # because it's the last item and we're just cheating by building
            # this into the value instead of the format code.
            "event_id": 39,
        }

        output = ""
        row_format = "".join(
            [
                "{:<" + str(columnLength + 5) + "}"
                for columnLength in headerColumnLengthMap.values()
            ]
        )
        output += row_format.format(*headerColumnLengthMap.keys()) + "\n"
        for row in txn:
            output += row_format.format(*[str(x) for x in row]) + "\n"
        return output

Todo

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

…e is invalid

Related to
 - #13622
    - #13635
 - #13676

Follow-up to #13815
which tracks event signature failures.

This PR aims to stop us from trying to
`_get_state_ids_after_missing_prev_event` after
we already know that the prev_event will fail
from a previous attempt

To explain an exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid.

 - `backfill`
    - `outgoing-federation-request` `/backfill`
    - `_check_sigs_and_hash_and_fetch`
       - `_check_sigs_and_hash_and_fetch_one` for each event received over backfill
          - ❗ `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
          - ❗ `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
   - `_process_pulled_events`
      - `_process_pulled_event` for each validated event
         - ❗ Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it
            - `_get_state_ids_after_missing_prev_event`
               - `outgoing-federation-request` `/state_ids`
               - ❗ `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again
               - ❗ `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check

With this PR, we no longer call the burdensome `_get_state_ids_after_missing_prev_event`
because the signature failure will count as an attempt before we try to run this.
@MadLittleMods MadLittleMods added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Sep 15, 2022
@MadLittleMods MadLittleMods added A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) A-Performance Performance, both client-facing and admin-facing labels Sep 15, 2022
See #13815 (comment)

So wes can avoid things like `CancelledError` which is a valid error
but probably should not count as an error
…nto madlittlemods/13622-13700-stop-getting-state-ids-after-missing-prev-event-if-we-tried-to-get-this-event-recently

Conflicts:
	synapse/storage/databases/main/event_federation.py
…nto madlittlemods/13622-13700-stop-getting-state-ids-after-missing-prev-event-if-we-tried-to-get-this-event-recently

Conflicts:
	synapse/storage/databases/main/event_federation.py
Base automatically changed from madlittlemods/13700-track-when-event-signature-fails to develop October 3, 2022 19:53
MadLittleMods added a commit that referenced this pull request Oct 3, 2022
Because we're doing the recording in `_check_sigs_and_hash_for_pulled_events_and_fetch` (previously named `_check_sigs_and_hash_and_fetch`), this means we will track signature failures for `backfill`, `get_room_state`, `get_event_auth`, and `get_missing_events` (all pulled event scenarios). And we also record signature failures from `get_pdu`.

Part of #13700

Part of #13676 and #13356

This PR will be especially important for #13816 so we can avoid the costly `_get_state_ids_after_missing_prev_event` down the line when `/messages` calls backfill.
…ate-ids-after-missing-prev-event-if-we-tried-to-get-this-event-recently
@MadLittleMods MadLittleMods marked this pull request as ready for review October 5, 2022 10:12
@erikjohnston erikjohnston self-assigned this Oct 12, 2022
Comment on lines 1576 to 1585
/**
* Exponential back-off (up to the upper bound) so we don't try to
* pull the same event over and over. ex. 2hr, 4hr, 8hr, 16hr, etc.
*
* We use `1 << n` as a power of 2 equivalent for compatibility
* with older SQLites. The left shift equivalent only works with
* powers of 2 because left shift is a binary operation (base-2).
* Otherwise, we would use `power(2, n)` or the power operator, `2^n`.
*/
AND ? /* current_time */ < last_attempt_ts + (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that these inline SQL comments get sent over the wire to postgres, which is probably not totally ideal. Might be worth moving the big ones out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, TBH I'm wondering whether for this it would be clearer to just fetch event_id, last_attempt_ts, num_attempts and do the checks in python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, TBH I'm wondering whether for this it would be clearer to just fetch event_id, last_attempt_ts, num_attempts and do the checks in python?

I think this makes a lot of sense here since we have to get a response out for each event_ids anyway 👍



It occurs to me that these inline SQL comments get sent over the wire to postgres, which is probably not totally ideal. Might be worth moving the big ones out?

In terms of other places to worry about this: we already strip out the newlines which is why I use /* */ instead of inline --. Is there a way we can strip the comments out so we don't lose the nice readability? The performance of find/replacing this every time doesn't sound great but our _make_sql_one_line doesn't look particularly optimized by eye either 🤔. I'm not sure why we started doing that (introduced in #1939), maybe to get rid of all of the extra indentation spaces in multi-line strings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking about that too, though we'd want to make sure that we do it safely (I'm not sure if just stripping /* ... */ is safe by itself).

Note that we don't actually send the one-line SQL to the database, we use it for logging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #14192 to continue this comments in SQL conversation ⏩

Note that we don't actually send the one-line SQL to the database, we use it for logging.

huh, I swore this was a problem before. Good to know!

synapse/handlers/federation_event.py Outdated Show resolved Hide resolved
synapse/handlers/federation_event.py Show resolved Hide resolved
@@ -274,6 +274,31 @@ def __init__(self, destination: Optional[str]):
)


class FederationPullAttemptBackoffError(SynapseError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it makes sense for this to inherit from SynapseError. The risk is that this manages to bubble all the way to the client API, returning a 403, which seems entirely wrong.

Copy link
Contributor Author

@MadLittleMods MadLittleMods Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 👍. The docstrings for SynapseError and FederationError could use some improving (would do in separate PR so we don't hang up on that).

We probably want to also update FederationDeniedError in a separate PR as well?

class FederationDeniedError(SynapseError):

Copy link
Contributor Author

@MadLittleMods MadLittleMods Oct 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #14190 and #14191 to track these follow-up updates to explain those error classes better ⏩

synapse/handlers/federation_event.py Show resolved Hide resolved
Comment on lines +1573 to +1584
# Exponential back-off (up to the upper bound) so we don't try to
# pull the same event over and over. ex. 2hr, 4hr, 8hr, 16hr, etc.
if current_time
< event_failed_pull_attempt["last_attempt_ts"]
+ (
2
** min(
event_failed_pull_attempt["num_attempts"],
BACKFILL_EVENT_EXPONENTIAL_BACKOFF_MAXIMUM_DOUBLING_STEPS,
)
)
* BACKFILL_EVENT_EXPONENTIAL_BACKOFF_STEP_MILLISECONDS
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Sanity check appreciated here to make sure we're doing the equivalent of

/**
* Exponential back-off (up to the upper bound) so we don't retry the
* same backfill point over and over. ex. 2hr, 4hr, 8hr, 16hr, etc.
*
* We use `1 << n` as a power of 2 equivalent for compatibility
* with older SQLites. The left shift equivalent only works with
* powers of 2 because left shift is a binary operation (base-2).
* Otherwise, we would use `power(2, n)` or the power operator, `2^n`.
*/
AND (
failed_backfill_attempt_info.event_id IS NULL
OR ? /* current_time */ >= failed_backfill_attempt_info.last_attempt_ts + (
(1 << {least_function}(failed_backfill_attempt_info.num_attempts, ? /* max doubling steps */))
* ? /* step */
)
)

@erikjohnston erikjohnston changed the title Stop getting missing prev_events after we already know their signature is invalid Stop getting missing prev_events after we already know their signature is invalid Oct 14, 2022
synapse/handlers/federation.py Outdated Show resolved Hide resolved
synapse/handlers/federation_event.py Outdated Show resolved Hide resolved
synapse/handlers/federation_event.py Show resolved Hide resolved
MadLittleMods added a commit that referenced this pull request Oct 15, 2022
MadLittleMods added a commit that referenced this pull request Oct 15, 2022
@MadLittleMods MadLittleMods merged commit 40bb37e into develop Oct 15, 2022
@MadLittleMods MadLittleMods deleted the madlittlemods/13622-13700-stop-getting-state-ids-after-missing-prev-event-if-we-tried-to-get-this-event-recently branch October 15, 2022 05:36
@MadLittleMods
Copy link
Contributor Author

Thanks for the review @erikjohnston 🐫

MadLittleMods added a commit that referenced this pull request Oct 19, 2022
Explain `SynapseError` and `FederationError` better

Spawning from #13816 (comment)
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Oct 29, 2022
Upstream changes:

Synapse 1.70.1 (2022-10-28)
===========================

(bugfixes)


Synapse 1.70.0 (2022-10-26)
===========================

Features
--------

- Support for
  [MSC3856](matrix-org/matrix-spec-proposals#3856):
  threads list
  API. ([\#13394](matrix-org/synapse#13394),
  [\#14171](matrix-org/synapse#14171),
  [\#14175](matrix-org/synapse#14175))

- Support for thread-specific notifications & receipts
  ([MSC3771](matrix-org/matrix-spec-proposals#3771)
  and
  [MSC3773](matrix-org/matrix-spec-proposals#3773)). ([\#13776](matrix-org/synapse#13776),
  [\#13824](matrix-org/synapse#13824),
  [\#13877](matrix-org/synapse#13877),
  [\#13878](matrix-org/synapse#13878),
  [\#14050](matrix-org/synapse#14050),
  [\#14140](matrix-org/synapse#14140),
  [\#14159](matrix-org/synapse#14159),
  [\#14163](matrix-org/synapse#14163),
  [\#14174](matrix-org/synapse#14174),
  [\#14222](matrix-org/synapse#14222))

- Stop fetching missing `prev_events` after we already know their
  signature is
  invalid. ([\#13816](matrix-org/synapse#13816))

- Send application service access tokens as a header (and query
  parameter). Implements
  [MSC2832](matrix-org/matrix-spec-proposals#2832). ([\#13996](matrix-org/synapse#13996))

- Ignore server ACL changes when generating pushes. Implements
  [MSC3786](matrix-org/matrix-spec-proposals#3786). ([\#13997](matrix-org/synapse#13997))

- Experimental support for redirecting to an implementation of a
  [MSC3886](matrix-org/matrix-spec-proposals#3886)
  HTTP rendezvous
  service. ([\#14018](matrix-org/synapse#14018))

- The `/relations` endpoint can now be used on
  workers. ([\#14028](matrix-org/synapse#14028))

- Advertise support for Matrix 1.3 and 1.4 on
  `/_matrix/client/versions`. ([\#14032](matrix-org/synapse#14032),
  [\#14184](matrix-org/synapse#14184))

- Improve validation of request bodies for the [Device
  Management](https://spec.matrix.org/v1.4/client-server-api/#device-management)
  and [MSC2697 Device
  Dehyrdation](matrix-org/matrix-spec-proposals#2697)
  client-server API
  endpoints. ([\#14054](matrix-org/synapse#14054))

- Experimental support for
  [MSC3874](matrix-org/matrix-spec-proposals#3874):
  Filtering threads from the `/messages`
  endpoint. ([\#14148](matrix-org/synapse#14148))

- Improve the validation of the following PUT endpoints:
  [`/directory/room/{roomAlias}`](https://spec.matrix.org/v1.4/client-server-api/#put_matrixclientv3directoryroomroomalias),
  [`/directory/list/room/{roomId}`](https://spec.matrix.org/v1.4/client-server-api/#put_matrixclientv3directorylistroomroomid)
  and
  [`/directory/list/appservice/{networkId}/{roomId}`](https://spec.matrix.org/v1.4/application-service-api/#put_matrixclientv3directorylistappservicenetworkidroomid). ([\#14179](matrix-org/synapse#14179))


Deprecations and Removals
-------------------------

- Remove the experimental implementation of
  [MSC3772](matrix-org/matrix-spec-proposals#3772). ([\#14094](matrix-org/synapse#14094))

- Remove the unstable identifier for
  [MSC3715](matrix-org/matrix-spec-proposals#3715). ([\#14106](matrix-org/synapse#14106),
  [\#14146](matrix-org/synapse#14146))
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) A-Performance Performance, both client-facing and admin-facing T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants