Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Complement TestDeviceListUpdates/when_remote_user_joins_a_room flake(?) #14103

Closed
DMRobertson opened this issue Oct 7, 2022 · 21 comments · Fixed by matrix-org/complement#593 or matrix-org/complement#610
Assignees
Labels
A-Device-List-Tracking Telling clients about other devices. Often related to E2EE. A-Testing Issues related to testing in complement, synapse, etc T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact Z-Flake Tests that give intermittent failures

Comments

@DMRobertson
Copy link
Contributor

https://github.com/matrix-org/synapse/actions/runs/3204302461/jobs/5235520023

      client.go:604: [CSAPI] POST hs1/_matrix/client/v3/register => 200 OK (195.428656ms)
      client.go:604: [CSAPI] POST hs2/_matrix/client/v3/register => 200 OK (314.031457ms)
      client.go:604: [CSAPI] POST hs2/_matrix/client/v3/keys/upload => 200 OK (110.185852ms)
      client.go:604: [CSAPI] POST hs1/_matrix/client/v3/createRoom => 200 OK (972.145549ms)
      client.go:604: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (391.917818ms)
      client.go:604: [CSAPI] POST hs2/_matrix/client/v3/join/!arpRCWNVBTAxZTROZD:hs1 => 200 OK (1.511062331s)
      client.go:604: [CSAPI] GET hs2/_matrix/client/v3/sync => 200 OK (607.838192ms)
      client.go:604: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (68.691406ms)
      client.go:604: [CSAPI] POST hs1/_matrix/client/v3/keys/query => 200 OK (421.023188ms)
      client.go:604: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (72.393428ms)
      client.go:604: [CSAPI] POST hs2/_matrix/client/v3/keys/upload => 200 OK (95.074462ms)
      client.go:604: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (25.043548ms)
      client.go:604: [CSAPI] POST hs1/_matrix/client/v3/keys/query => 200 OK (57.322639ms)
      device_lists_test.go:139: MatchResponse key 'device_keys.@bob4:hs2.ZJUCSAPZAX.keys.ed25519:ZJUCSAPZAX' got 'PbiLbZNrfrWbmKA7kCBEhJJ0GCXy+1TElIftsHrUq/k' want 'vxjhBYRfe3Q5xzOtu4i1BBX2fNtOcCPfmDp5Fhz7Nh0' - http://localhost:49456/_matrix/client/v3/keys/query => {"device_keys":{"@bob4:hs2":{"ZJUCSAPZAX":{"algorithms":["m.olm.v1.curve25519-aes-sha2","m.megolm.v1.aes-sha2"],"device_id":"ZJUCSAPZAX","keys":{"curve25519:ZJUCSAPZAX":"L++DP/WvCewd4aygNB0hNiMxyz0SH+ervHL0zadrGg4","ed25519:ZJUCSAPZAX":"PbiLbZNrfrWbmKA7kCBEhJJ0GCXy+1TElIftsHrUq/k"},"user_id":"@bob4:hs2","unsigned":{}}}},"failures":{},"master_keys":{},"self_signing_keys":{},"user_signing_keys":{}}

Very possibly covered by another issue. Sorry for noise if so.

@DMRobertson DMRobertson added T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. A-Testing Issues related to testing in complement, synapse, etc Z-Flake Tests that give intermittent failures labels Oct 7, 2022
@squahtx squahtx added the Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact label Oct 7, 2022
@squahtx squahtx changed the title Complement when_remote_user_joins_a_room flake(?) Complement TestDeviceListUpdates/when_remote_user_joins_a_room flake(?) Oct 7, 2022
@squahtx squahtx added the A-Device-List-Tracking Telling clients about other devices. Often related to E2EE. label Oct 7, 2022
@squahtx
Copy link
Contributor

squahtx commented Oct 7, 2022

todo: figure out whether this has the same cause as matrix-org/complement#580

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@squahtx
Copy link
Contributor

squahtx commented Nov 8, 2022

@squahtx
Copy link
Contributor

squahtx commented Nov 11, 2022

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

DMRobertson commented Nov 30, 2022

https://github.com/matrix-org/synapse/actions/runs/3586096523/jobs/6034942439

Edit: @realtyem notes:

Head's up: The Complement run you added to the issue 14103, the error was different this time(if it matters). Returned a 502 "Bad gateway" style error this time. Like this:

client.go:602: [CSAPI] POST hs2/_matrix/client/v3/join/!oVmMWustnnCJYZfUFM:hs1 => 502 Bad Gateway (46.00983ms)

device_lists_test.go:116: CSAPI.MustDoFunc POST http://127.0.0.1:49414/_matrix/client/v3/join/%21oVmMWustnnCJYZfUFM:hs1?server_name=hs1 returned non-2xx code: 502 Bad Gateway - body:

{"errcode":"M_UNKNOWN","error":"Failed to make_join via any server"}

So there may be multiple failure modes at play here(?)

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@realtyem
Copy link
Contributor

realtyem commented Dec 2, 2022

@DMRobertson
Copy link
Contributor Author

@DMRobertson
Copy link
Contributor Author

@squahtx
Copy link
Contributor

squahtx commented Feb 14, 2023

Reopening this issue, but this time the flake has a different cause to the original occurrence.

@squahtx
Copy link
Contributor

squahtx commented Feb 14, 2023

Taking https://github.com/matrix-org/synapse/actions/runs/4145448298/jobs/7169889241 as an example of the flake:

We have two users, Alice (@alice10:hs1) and Bob (@bob11:hs2). In this test, Alice remote-joins Bob's room, expects to see a device_lists.changed entry in /sync upon join, and then expects to see another device_lists.changed entry in /sync when Bob sends a device list update.

The test room is !irYOTaYDzxImjNsOIV:hs2
And there is a barrier room from the fix for the original cause, !HFsvlVDacxfKQhqVrP:hs2.


During the test run:

Bob (hs2) creates the test room !irYOTaYDzxImjNsOIV:hs2.

Alice (hs1) joins the remote room.
hs2 sends back the membership for Bob as a hero.
hs1 thinks both Alice and Bob are in the room.

event_creator1   | 2023-02-10 16:08:17,339 Processed POST /_matrix/client/v3/join/%21irYOTaYDzxImjNsOIV:hs2?server_name=hs2

Alice (hs1) expects a device_lists.changed entry for Bob (hs2), because Alice has just joined a room shared with Bob.
At this point the room is still partial stated and does not appear in the sync response.
Unexpected behaviour 1: Despite the room still being partial stated, Bob appears in device_lists.changed and the test advances.
Arguably this is weird, but not forbidden by the spec.

synchrotron1     | 2023-02-10 16:08:17,562 Failed to find any events in room !irYOTaYDzxImjNsOIV:hs2 at RoomStreamToken(topological=None, stream=42, instance_map=frozendict.frozendict({}))
synchrotron1     | 2023-02-10 16:08:17,565 Processed GET /_matrix/client/v3/sync?since=s42_9_0_1_1_1_1_19_0_3&timeout=1000
master           | 2023-02-10 16:08:17,685 Processed POST /_matrix/client/v3/keys/query

Unexpected behaviour 2: Synapse is looking for events in the partial state room for some reason.

Synapse emits another changed update after querying keys above.
The test tries to skip over it by sending another device list update and waiting for it.

master           | 2023-02-10 16:08:17,744 sync_partial_state_room-2 - Updating current state for !irYOTaYDzxImjNsOIV:hs2
event_persister1 | 2023-02-10 16:08:17,807 Processed POST /_synapse/replication/update_current_state/%21irYOTaYDzxImjNsOIV%3Ahs2/ggnOYRzLBl
master           | 2023-02-10 16:08:17,836 sync_partial_state_room-2 - Handling any pending device list updates
master           | 2023-02-10 16:08:17,890 Received device list update for @barry12:hs2, requiring resync: True. Devices: LOMNRREYOD
master           | 2023-02-10 16:08:17,874 sync_partial_state_room-2 - Clearing partial-state flag for !irYOTaYDzxImjNsOIV:hs2
synchrotron1     | 2023-02-10 16:08:17,966 Processed GET /_matrix/client/v3/sync?since=s43_10_0_1_1_1_1_19_0_3&timeout=1000

Bob then updates their device list.
Alice expects to see a device list update from Bob, as a device_lists.changed entry in /sync.

master           | 2023-02-10 16:08:17,996 sync_partial_state_room-2 - State resync complete for !irYOTaYDzxImjNsOIV:hs2
synchrotron1     | 2023-02-10 16:08:18,183 Processed GET /_matrix/client/v3/sync?since=s44_10_0_1_1_1_1_22_0_3&timeout=1000
master           | 2023-02-10 16:08:18,229 Received device list update for @bob11:hs2, requiring resync: True. Devices: PZYMGRAHFY
master           | 2023-02-10 16:08:18,268 Processed POST /_matrix/client/v3/keys/query
master           | 2023-02-10 16:08:18,339 Processed POST /_synapse/replication/fed_send_edu/m.device_list_update/ABXwrPvBSd

Unexpected behaviour 3: Alice does see a device_lists.changed entry for Bob, except it's due to the room being un-partial stated instead of due to Bob's device list update, which hs1 receives afterwards.

@squahtx
Copy link
Contributor

squahtx commented Feb 14, 2023

Unexpected behaviour 1: Despite the room still being partial stated, Bob appears in device_lists.changed and the test advances.

Unexpected behaviour 2: Synapse is looking for events in the partial state room for some reason.

Both of these are fixed by #15069.

Unexpected behaviour 3: Alice does see a device_lists.changed entry for Bob, except it's due to the room being un-partial stated instead of due to Bob's device list update, which hs1 receives afterwards.

Fixed by matrix-org/complement#610, which ensures that rooms are un-partial stated very early in the tests.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Device-List-Tracking Telling clients about other devices. Often related to E2EE. A-Testing Issues related to testing in complement, synapse, etc T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact Z-Flake Tests that give intermittent failures
Projects
None yet
4 participants