Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

deal with an event that is in event_json but not events #10718

Closed
JanZerebecki opened this issue Aug 30, 2021 · 16 comments
Closed

deal with an event that is in event_json but not events #10718

JanZerebecki opened this issue Aug 30, 2021 · 16 comments
Labels
P3 (OBSOLETE: use S- labels.) Approved backlog: not yet scheduled, will accept patches T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@JanZerebecki
Copy link

Description

When an event comes in that is not in events, when trying to persist it in event_json a possible duplicate should be handled. Currently

self.db_pool.simple_insert_many_txn(
txn,
table="event_json",
fails (see exception below). I do not have enough domain knowledge to say whether to leave the existing entry unchanged or to update the existing entry. I assume the better choice is to replace it as then the data comes from the same input. However that might be a vector for maliciously overwriting data and breaking history integrity.

The same issue was also seen in #8889 but not further discussed (besides the manually deleting the event from event_json).

Aug 27 11:43:51 matrix synapse[27658]: synapse.handlers.federation: [_process_incoming_pdus_in_room_inner-13-$REDACTED1] Error attempting to resolve state at missing prev_events
                                       Traceback (most recent call last):
                                         File "/usr/lib/python3.6/site-packages/synapse/handlers/federation.py", line 387, in on_receive_pdu
                                           origin, room_id, p
                                         File "/usr/lib/python3.6/site-packages/synapse/handlers/federation.py", line 732, in _get_state_after_missing_prev_event
                                           destination=destination, room_id=room_id, events=missing_events
                                         File "/usr/lib/python3.6/site-packages/synapse/handlers/federation.py", line 1379, in _get_events_and_persist
                                           event_infos,
                                         File "/usr/lib/python3.6/site-packages/synapse/handlers/federation.py", line 2323, in _auth_and_persist_events
                                           backfilled=backfilled,
                                         File "/usr/lib/python3.6/site-packages/synapse/handlers/federation.py", line 3320, in persist_events_and_notify
                                           event_and_contexts, backfilled=backfilled
                                       File "/usr/lib/python3.6/site-packages/synapse/storage/persist_events.py", line 322, in persist_events
                                           ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
                                         File "/usr/lib/python3.6/site-packages/synapse/storage/persist_events.py", line 240, in handle_queue_loop
                                           item.events_and_contexts, item.backfilled
                                         File "/usr/lib/python3.6/site-packages/synapse/storage/persist_events.py", line 582, in _persist_event_batch
                                           backfilled=backfilled,
                                         File "/usr/lib/python3.6/site-packages/synapse/storage/databases/main/events.py", line 180, in _persist_events_and_state_updates
                                           new_forward_extremeties=new_forward_extremeties,
                                         File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 694, in runInteraction
                                           **kwargs,
                                         File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 792, in runWithConnection
                                           self._db_pool.runWithConnection(inner_func, *args, **kwargs)
                                         File "/usr/lib/python3.6/site-packages/twisted/python/threadpool.py", line 238, in inContext
                                           result = inContext.theWork()  # type: ignore[attr-defined]
                                         File "/usr/lib/python3.6/site-packages/twisted/python/threadpool.py", line 255, in <lambda>
                                           ctx, func, *args, **kw
                                         File "/usr/lib/python3.6/site-packages/twisted/python/context.py", line 118, in callWithContext
                                           return self.currentContext().callWithContext(ctx, func, *args, **kw)
                                         File "/usr/lib/python3.6/site-packages/twisted/python/context.py", line 83, in callWithContext
                                           return func(*args, **kw)
                                         File "/usr/lib/python3.6/site-packages/twisted/enterprise/adbapi.py", line 293, in _runWithConnection
                                           compat.reraise(excValue, excTraceback)
                                         File "/usr/lib/python3.6/site-packages/twisted/python/deprecate.py", line 298, in deprecatedFunction
                                           return function(*args, **kwargs)
                                         File "/usr/lib/python3.6/site-packages/twisted/python/compat.py", line 404, in reraise
                                           raise exception.with_traceback(traceback)
                                         File "/usr/lib/python3.6/site-packages/twisted/enterprise/adbapi.py", line 284, in _runWithConnection
                                           result = func(conn, *args, **kw)
                                         File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 786, in inner_func
                                           return func(db_conn, *args, **kwargs)
                                         File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 554, in new_transaction
                                           r = func(cursor, *args, **kwargs)
                                         File "/usr/lib/python3.6/site-packages/synapse/logging/utils.py", line 69, in wrapped
                                           return f(*args, **kwargs)
                                         File "/usr/lib/python3.6/site-packages/synapse/storage/databases/main/events.py", line 378, in _persist_events_txn
                                           self._store_event_txn(txn, events_and_contexts=events_and_contexts)
                                         File "/usr/lib/python3.6/site-packages/synapse/storage/databases/main/events.py", line 1350, in _store_event_txn
                                           for event, _ in events_and_contexts
                                         File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 929, in simple_insert_many_txn
                                           txn.execute_batch(sql, vals)
                                         File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 279, in execute_batch
                                           self._do_execute(lambda *x: execute_batch(self.txn, *x), sql, args)
                                         File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 331, in _do_execute
                                           return func(sql, *args)
                                         File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 279, in <lambda>
                                           self._do_execute(lambda *x: execute_batch(self.txn, *x), sql, args)
                                         File "/usr/lib64/python3.6/site-packages/psycopg2/extras.py", line 1209, in execute_batch
                                           cur.execute(b";".join(sqls))
                                       psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "event_json_event_id_key"
                                       DETAIL:  Key (event_id)=($Redacted2) already exists.

Steps to reproduce

  • have an event come in that is in event_json but not events

Version information

  • Homeserver: matrix.opensuse.org

If not matrix.org:

  • Version: 1.41.0

  • Install method:

@reivilibre
Copy link
Contributor

(As far as I'm aware,) It's not meant to be the case that you can have an event in events_json but not events (I guess we would put a foreign key constraint on, but that might be bad for performance). Indeed, insertions to both happen in the same transaction.

I'm therefore led to believe that something weird has gone on with your database (do you know if you've done anything unusual that could have led to this? Restored from an old backup, ... anything?).

However, as a point of robustness, it might be sensible to implement something like your PR.

I will raise it for discussion.

@JanZerebecki
Copy link
Author

Yes, there was an DB upgrade that ran into problems, need to check how that was solved. Though that is not an obvious explanation. How likely is it that any of the tools involved pick up half of a transaction? Maybe the transaction is committed by some other mechanism than explicitly by the code doing the insert in certain situations?

JanZerebecki added a commit to JanZerebecki/synapse that referenced this issue Sep 1, 2021
When an incoming event id is present in event_json but not in events
synapse fails trying to insert it with "psycopg2.errors.UniqueViolation:
duplicate key value violates unique constraints", because it is only
filtered based on those that are in events.

I don't know why those become out of sync, but this happening was
reported by others before.

Fix this by using an upsert (which inserts or updates existing records)
instead of a normal insert.

Please verify that this is the safe and correct thing to do before
merging this. Verify e.g. that it doesn't allow breaking history
integrity or something like it. As I don't know enough to understand
what this change entails.

Fixes: matrix-org#10718
Signed-off-by: Jan Zerebecki <[email protected]>
@JanZerebecki
Copy link
Author

After the failed postgresql upgrade https://progress.opensuse.org/issues/93686 it seems there was some manual editing of the database: https://progress.opensuse.org/issues/94189 .
That reads as if these manual edits were the cause and the lack of a constraint made it so nobody noticed until investigating why there was a lag of a few days for messages incoming over federation.

@callahad
Copy link
Contributor

callahad commented Sep 2, 2021

Thank you for proposing a pull request (#10719); we'll discuss that there. However, given that manual database edits were involved in reaching this state, it's not immediately clear that the solution in your PR would leave you in a correct -- rather than just differently broken -- state.

Adding more constraints to the database would be helpful, but difficult to do after the fact with such a large table. We've also thought about trying to find a way to merge both the events and event_json tables, but haven't done any actual work to that effect.

It might help to purge the offending rooms from the server, if the side effects thereof are acceptable.

@JanZerebecki
Copy link
Author

Loosing more data of the relevant rooms is probably worse. Is there working tooling to backup and restore room history from outside synapse? Is there existing code to check or repair more consistency than the DB constraints provide?

@JanZerebecki
Copy link
Author

Note: It was necessary to edit the DB because postgresql violated its own UNIQUE constraints because it doesn't notice if glibc collation changes require a reindex.

@richvdh
Copy link
Member

richvdh commented Sep 10, 2021

There is no additional tooling that I am aware of.

It was necessary to edit the DB because postgresql violated its own UNIQUE constraints because it doesn't notice if glibc collation changes require a reindex.

this is why synapse emits a warning if you use incorrect collation or ctype settings (#6734)

@JanZerebecki
Copy link
Author

Thx, the warning is indeed in the logs: synapse.storage.engines.postgres: [main] Database has incorrect ctype of 'en_US.UTF-8'. Should be 'C'

Perhaps instead of a warning that should be made to abort startup unless a new setting corrup_my_db_silently_on_libc_update: true is set.

Perhaps a better way instead of my above PR is to create a tool shipped with synapse that can be used by an admin to run through the db to check for and and optionally fix such inconsistencies, either by synthesizing the missing rows from other tables or by fetching from federation or by deleting those rows that have their corresponding row missing. Though I probably won't get to that.

After running synapse with the above PR as of d891766 I get another constraint violation which I'll try fix in a similar way.

Sep 15 15:16:14 matrix synapse[1992]: synapse.handlers.federation_event: [_process_incoming_pdus_in_room_inner-REDACTED1] Error attempting to resolve state at missing prev_events
                                      Traceback (most recent call last):
                                        File "/usr/lib/python3.6/site-packages/synapse/handlers/federation_event.py", line 712, in _resolve_state_at_missing_prevs
                                          dest, room_id, p
                                        File "/usr/lib/python3.6/site-packages/synapse/handlers/federation_event.py", line 819, in _get_state_after_missing_prev_event
                                          destination=destination, room_id=room_id, events=missing_events
                                        File "/usr/lib/python3.6/site-packages/synapse/handlers/federation_event.py", line 1155, in _get_events_and_persist
                                          event_infos,
                                        File "/usr/lib/python3.6/site-packages/synapse/handlers/federation_event.py", line 1198, in _auth_and_persist_events
                                          for ev_info, context in zip(event_infos, contexts)
                                        File "/usr/lib/python3.6/site-packages/synapse/handlers/federation_event.py", line 1743, in persist_events_and_notify
                                          event_and_contexts, backfilled=backfilled
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/persist_events.py", line 322, in persist_events
                                          ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/persist_events.py", line 240, in handle_queue_loop
                                          item.events_and_contexts, item.backfilled
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/persist_events.py", line 582, in _persist_event_batch
                                          backfilled=backfilled,
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/databases/main/events.py", line 180, in _persist_events_and_state_updates
                                          new_forward_extremeties=new_forward_extremeties,
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 694, in runInteraction
                                          **kwargs,
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 792, in runWithConnection
                                          self._db_pool.runWithConnection(inner_func, *args, **kwargs)
                                        File "/usr/lib/python3.6/site-packages/twisted/python/threadpool.py", line 238, in inContext
                                          result = inContext.theWork()  # type: ignore[attr-defined]
                                        File "/usr/lib/python3.6/site-packages/twisted/python/threadpool.py", line 255, in <lambda>
                                          ctx, func, *args, **kw
                                        File "/usr/lib/python3.6/site-packages/twisted/python/context.py", line 118, in callWithContext
                                          return self.currentContext().callWithContext(ctx, func, *args, **kw)
                                        File "/usr/lib/python3.6/site-packages/twisted/python/context.py", line 83, in callWithContext
                                          return func(*args, **kw)
                                        File "/usr/lib/python3.6/site-packages/twisted/enterprise/adbapi.py", line 293, in _runWithConnection
                                          compat.reraise(excValue, excTraceback)
                                        File "/usr/lib/python3.6/site-packages/twisted/python/deprecate.py", line 298, in deprecatedFunction
                                          return function(*args, **kwargs)
                                        File "/usr/lib/python3.6/site-packages/twisted/python/compat.py", line 404, in reraise
                                          raise exception.with_traceback(traceback)
                                        File "/usr/lib/python3.6/site-packages/twisted/enterprise/adbapi.py", line 284, in _runWithConnection
                                          result = func(conn, *args, **kw)
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 786, in inner_func
                                          return func(db_conn, *args, **kwargs)
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 554, in new_transaction
                                          r = func(cursor, *args, **kwargs)
                                        File "/usr/lib/python3.6/site-packages/synapse/logging/utils.py", line 69, in wrapped
                                          return f(*args, **kwargs)
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/databases/main/events.py", line 378, in _persist_events_txn
                                          self._store_event_txn(txn, events_and_contexts=events_and_contexts)
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/databases/main/events.py", line 1415, in _store_event_txn
                                          txn, table="state_events", values=state_values
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 929, in simple_insert_many_txn
                                          txn.execute_batch(sql, vals)
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 279, in execute_batch
                                          self._do_execute(lambda *x: execute_batch(self.txn, *x), sql, args)
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 331, in _do_execute
                                          return func(sql, *args)
                                        File "/usr/lib/python3.6/site-packages/synapse/storage/database.py", line 279, in <lambda>
                                          self._do_execute(lambda *x: execute_batch(self.txn, *x), sql, args)
                                        File "/usr/lib64/python3.6/site-packages/psycopg2/extras.py", line 1209, in execute_batch
                                          cur.execute(b";".join(sqls))
                                      psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "state_events_event_id_key"
                                      DETAIL:  Key (event_id)=(REDACTED2) already exists.

@reivilibre
Copy link
Contributor

reivilibre commented Sep 16, 2021

For what it's worth:

Synapse will refuse to set up a new database if it has the wrong values of COLLATE and CTYPE set, and will log warnings on existing databases.

https://matrix-org.github.io/synapse/latest/postgres.html#fixing-incorrect-collate-or-ctype

I don't know how long that has been the case*; wonder if you created your database before then or perhaps managed to flip it over during a backup/restore?

*edit: #6734 — early 2020.

@callahad
Copy link
Contributor

Perhaps instead of a warning that should be made to abort startup unless a new setting corrup_my_db_silently_on_libc_update: true is set.

We're fine going with this approach to at least help people avoid silent corruption... details tbd.

@callahad
Copy link
Contributor

We should also probably link to information on how Glibc collation changes can break PostgreSQL. E.g., https://wiki.postgresql.org/wiki/Locale_data_changes

@callahad callahad added P3 (OBSOLETE: use S- labels.) Approved backlog: not yet scheduled, will accept patches T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. and removed X-Needs-Discussion labels Sep 23, 2021
@ananace
Copy link

ananace commented Nov 3, 2021

I ran into some database corruption due to a hardware failure, which caused some data from the event tables to go out of sync with eachother.
Unfortunately Synapse wasn't able to recover since it'd crash out on storing any of the missing data, and since it was a hardware fault there was data from all over the place that was corrupted, so I couldn't easily drop unneeded events to let them redownload that way.

I ended up taking the linked PR, rebasing it to 1.46 and expanding it so all event storage was done through upserts, which was able to take my server back to a live state again.

@hellcp
Copy link

hellcp commented Nov 13, 2021

I don't know how long that has been the case*; wonder if you created your database before then or perhaps managed to flip it over during a backup/restore?

I distinctly remember setting up the database with correct collate and ctype values, so I assume it must have happened during the large db outage we had in the summer

@richvdh
Copy link
Member

richvdh commented Feb 21, 2022

We don't seem to have written down a clear conclusion about what to do with this, but I'm not happy switching all our inserts to upserts in the hope that it somehow works around a corrupt index, so I'm going to go ahead and close it.

@richvdh richvdh closed this as completed Feb 21, 2022
@JanZerebecki
Copy link
Author

Please reopen. We did get to a conclusion but the change is not implemented, yet:

Perhaps instead of a warning that should be made to abort startup unless a new setting corrup_my_db_silently_on_libc_update: true is set.

We're fine going with this approach to at least help people avoid silent corruption... details tbd.

We should probably also add the commands needed to drop the broken data from a DB to https://matrix-org.github.io/synapse/latest/postgres.html#fixing-incorrect-collate-or-ctype and also add:

We should also probably link to information on how Glibc collation changes can break PostgreSQL. E.g., https://wiki.postgresql.org/wiki/Locale_data_changes

@richvdh
Copy link
Member

richvdh commented Feb 21, 2022

I suggest opening a new issue for those changes... they are quite a long way from the original issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P3 (OBSOLETE: use S- labels.) Approved backlog: not yet scheduled, will accept patches T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants