Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

events get soft-failed when the federation_inbound worker is busy #7744

Open
richvdh opened this issue Jun 25, 2020 · 7 comments
Open

events get soft-failed when the federation_inbound worker is busy #7744

richvdh opened this issue Jun 25, 2020 · 7 comments
Assignees
Labels
A-Federation A-Soft-Failure O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Jun 25, 2020

event ids $44J0EFE_wD30pBA2EZ2hu_viEDWZ55CFZCJEliLoQOY, $d6_JvuU1AdlqnwBEK2shSB30_qFWIiZPlg526HJJuQQ, $FnodZIXcp1YSMx1RTvfp1RZIp51VgU32JCQaVmEv8Z8 were all mysteriously soft-failed on matrix.org, presumably because synapse thought that the sender wasn't in the room - but she had joined a clear 3 minutes earlier.

@richvdh
Copy link
Member Author

richvdh commented Jun 25, 2020

inspection of the state_groups around that event show clearly that the sender of the event was in the room. It's not even a particularly complex DAG, so it kinda has to be a transient problem with "current_state". Possibly there was a delay in invalidating the current_state cache on the worker processing the inbound events?

next steps might be to dig up the logs for when those events arrived over federation to see if there is anything funky?

@erikjohnston
Copy link
Member

FWIW I think we keep track of the forward extremities at a given stream position in stream_ordering_to_exterm table

@richvdh
Copy link
Member Author

richvdh commented Jun 30, 2020

so we have, on the master:

2020-06-24 18:27:40,673 - synapse.replication.http.membership - 230 - INFO - POST-7968707 - user membership change: @<user> in !<room>

and then, on federation_inbound:

2020-06-24 18:30:10,443 - synapse.handlers.federation - 2125 - WARNING - PUT-6586266-$44J0EFE_wD30pBA2EZ2hu_viEDWZ55CFZCJEliLoQOY - Soft-failing <FrozenEventV3 event_id='$44J0EFE_wD30pBA2EZ2hu_viEDWZ55CFZCJEliLoQOY', type='m.room.message', state_key='None'> because 403: User @<user> not in room !<room> (<FrozenEventV3 event_id='$_nUedokc9-SqfRglu-uUwmCTN6lhXgT2lFhjUXwUFug', type='m.room.member', state_key='@<user>'>)

so indeed it looks like a cache isn't being correctly flushed.

@richvdh
Copy link
Member Author

richvdh commented Jul 1, 2020

so, the federation inbound worker was 250K events behind at the time:

image

This is an ongoing problem.

@richvdh
Copy link
Member Author

richvdh commented Jul 1, 2020

see also: #7444

@richvdh
Copy link
Member Author

richvdh commented Jul 1, 2020

also see also: #7669

@richvdh richvdh changed the title events mysteriously soft-failed events get soft-failed when the federation_inbound worker is busy Jul 1, 2020
@richvdh richvdh added z-bug (Deprecated Label) p1 labels Jul 1, 2020
@richvdh
Copy link
Member Author

richvdh commented Apr 20, 2021

also also see also: #6536 -> #10066

@MadLittleMods MadLittleMods added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. A-Federation labels Jul 8, 2021
@DMRobertson DMRobertson added S-Major Major functionality / product severely impaired, no satisfactory workaround. O-Uncommon Most users are unlikely to come across this or unexpected workflow A-Soft-Failure and removed z-p1 z-bug (Deprecated Label) labels Sep 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federation A-Soft-Failure O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

4 participants