Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Complement TestPartialStateJoin/Room_aliases_can_be_added_and_queried_during_a_resync is flakey #14543

Closed
DMRobertson opened this issue Nov 24, 2022 · 4 comments · Fixed by matrix-org/complement#570
Labels
A-Federated-Join joins over federation generally suck A-Testing Issues related to testing in complement, synapse, etc T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact Z-Flake Tests that give intermittent failures

Comments

@DMRobertson
Copy link
Contributor

TestPartialStateJoin/Room_aliases_can_be_added_and_queried_during_a_resync

      client.go:599: [CSAPI] POST hs1/_matrix/client/v3/register => 200 OK (32.809293ms)
      client.go:599: [CSAPI] GET hs1/_matrix/client/v3/capabilities => 200 OK (8.09415ms)
      server.go:170: Creating room !0-UceOsu61aV8dXGXSxM:host.docker.internal:33837 with version 9
      federation_room_join_partial_state_test.go:3468: Registered state_ids handler for event $g53VZ818ZJcgniyOAKqjgOgqN7Z7aEPYJ_Cq9a5NbVE
      federation_room_join_partial_state_test.go:3509: Registered /state handler for event $g53VZ818ZJcgniyOAKqjgOgqN7Z7aEPYJ_Cq9a5NbVE
      client.go:599: [CSAPI] POST hs1/_matrix/client/v3/join/!0-UceOsu61aV8dXGXSxM:host.docker.internal:33837 => 502 Bad Gateway (6.328761ms)
      federation_room_join_partial_state_test.go:3380: CSAPI.MustDoFunc POST http://localhost:49420/_matrix/client/v3/join/%210-UceOsu61aV8dXGXSxM:host.docker.internal:33837?server_name=host.docker.internal%3A33837 returned non-2xx code: 502 Bad Gateway - body: {"errcode":"M_UNKNOWN","error":"Failed to make_join via any server"}

Why did we get a 502?

2022-11-24T07:20:08.4197065Z nginx | 192.168.16.1 - - [24/Nov/2022:07:14:39 +0000] "POST /_matrix/client/v3/join/%210-UceOsu61aV8dXGXSxM:host.docker.internal:33837?server_name=host.docker.internal%3A33837 HTTP/1.1" 502 95 "-" "Go-http-client/1.1"

2022-11-24T07:20:08.4175716Z synapse_main | 2022-11-24 07:14:39,151 - synapse.http.servlet - 668 - WARNING - POST-516 - Unable to parse JSON from POST /_matrix/client/v3/join/%210-UceOsu61aV8dXGXSxM:host.docker.internal:33837?server_name=host.docker.internal%3A33837 response: Expecting value: line 1 column 1 (char 0) (b'')
2022-11-24T07:20:08.4176297Z synapse_main | 2022-11-24 07:14:39,155 - synapse.federation.federation_client - 850 - INFO - POST-516 - make_join: Not retrying server host.docker.internal:33837 because we tried it recently retry_last_ts=1669273939360 and we won't check for another retry_interval=600000ms.
2022-11-24T07:20:08.4177016Z synapse_main | 2022-11-24 07:14:39,156 - synapse.http.server - 108 - INFO - POST-516 - <SynapseRequest at 0x7f785e4a0c40 method='POST' uri='/_matrix/client/v3/join/%210-UceOsu61aV8dXGXSxM:host.docker.internal:33837?server_name=host.docker.internal%3A33837' clientproto='HTTP/1.0' site='8080'> SynapseError: 502 - Failed to make_join via any server
2022-11-24T07:20:08.4177822Z synapse_main | 2022-11-24 07:14:39,156 - synapse.access.http.8080 - 460 - INFO - POST-516 - ::ffff:127.0.0.1 - 8080 - {@t40alice:hs1} Processed request: 0.005sec/0.000sec (0.003sec, 0.000sec) (0.001sec/0.001sec/8) 84B 502 "POST /_matrix/client/v3/join/%210-UceOsu61aV8dXGXSxM:host.docker.internal:33837?server_name=host.docker.internal%3A33837 HTTP/1.0" "Go-http-client/1.1" [0 dbevts]

In particular:

Unable to parse JSON from POST /_matrix/client/v3/join/%210-UceOsu61aV8dXGXSxM:host.docker.internal:33837?server_name=host.docker.internal%3A33837 response: Expecting value: line 1 column 1 (char 0) (b'')

erm wat. Is this a complement test bug that I wrote?

@DMRobertson DMRobertson added A-Federated-Join joins over federation generally suck T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. Z-Flake Tests that give intermittent failures labels Nov 24, 2022
@squahtx squahtx added Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact A-Testing Issues related to testing in complement, synapse, etc labels Nov 28, 2022
@squahtx
Copy link
Contributor

squahtx commented Dec 8, 2022

Unable to parse JSON from POST /_matrix/client/v3/join/%210-UceOsu61aV8dXGXSxM:host.docker.internal:33837?server_name=host.docker.internal%3A33837 response: Expecting value: line 1 column 1 (char 0) (b'')

erm wat. Is this a complement test bug that I wrote?

That'll be fixed by matrix-org/complement#562

@squahtx
Copy link
Contributor

squahtx commented Dec 8, 2022

POST-516 - make_join: Not retrying server host.docker.internal:33837 because we tried it recently retry_last_ts=1669273939360 and we won't check for another retry_interval=600000ms.

That implies that there's leftover state from a previous test, where Synapse wanted to contact the complement homeserver, but we'd already torn it down.

2022-11-24T07:20:08.1949821Z synapse_main | 2022-11-24 07:12:26,742 - synapse.handlers.federation_event - 1069 - WARNING - sync_partial_state_room-3-$e_Iv3o9MlGOZ4Jqsd7-6GK1X9e3sq4RM0YFo2V-lBgw - Error attempting to resolve state at missing prev_events: Not retrying server host.docker.internal:33837 because we tried it recently retry_last_ts=1669273939360 and we won't check for another retry_interval=600000ms.
2022-11-24T07:20:08.1950673Z synapse_main | 2022-11-24 07:12:26,743 - synapse.handlers.federation - 1741 - ERROR - sync_partial_state_room-3 - Failed to get state for !0-MOmOwHdSisAviwUo7c:host.docker.internal:33837 at <FrozenEventV3 event_id=$e_Iv3o9MlGOZ4Jqsd7-6GK1X9e3sq4RM0YFo2V-lBgw, type=m.room.member, state_key=@t10alice:hs1, outlier=0> from host.docker.internal:33837 because ERROR 403: We can't get valid state history., giving up!
2022-11-24T07:20:08.1951109Z synapse_main | 2022-11-24 07:12:26,743 - synapse.metrics.background_process_metrics - 244 - ERROR - sync_partial_state_room-3 - Background process 'sync_partial_state_room' threw an exception
2022-11-24T07:20:08.1951230Z synapse_main | Traceback (most recent call last):
2022-11-24T07:20:08.1951611Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/handlers/federation_event.py", line 1052, in _compute_event_context_with_maybe_missing_prevs
2022-11-24T07:20:08.1951810Z synapse_main |     await self._get_state_ids_after_missing_prev_event(
2022-11-24T07:20:08.1952120Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/logging/opentracing.py", line 889, in _wrapper
2022-11-24T07:20:08.1952268Z synapse_main |     return await func(*args, **kwargs)  # type: ignore[misc]
2022-11-24T07:20:08.1952574Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/logging/opentracing.py", line 889, in _wrapper
2022-11-24T07:20:08.1952718Z synapse_main |     return await func(*args, **kwargs)  # type: ignore[misc]
2022-11-24T07:20:08.1953169Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/handlers/federation_event.py", line 1115, in _get_state_ids_after_missing_prev_event
2022-11-24T07:20:08.1953315Z synapse_main |     ) = await self._federation_client.get_room_state_ids(
2022-11-24T07:20:08.1953619Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/logging/opentracing.py", line 889, in _wrapper
2022-11-24T07:20:08.1953763Z synapse_main |     return await func(*args, **kwargs)  # type: ignore[misc]
2022-11-24T07:20:08.1954111Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/logging/opentracing.py", line 889, in _wrapper
2022-11-24T07:20:08.1954257Z synapse_main |     return await func(*args, **kwargs)  # type: ignore[misc]
2022-11-24T07:20:08.1954604Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/federation/federation_client.py", line 503, in get_room_state_ids
2022-11-24T07:20:08.1954758Z synapse_main |     result = await self.transport_layer.get_room_state_ids(
2022-11-24T07:20:08.1955096Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/federation/transport/client.py", line 78, in get_room_state_ids
2022-11-24T07:20:08.1955217Z synapse_main |     return await self.client.get_json(
2022-11-24T07:20:08.1955545Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/http/matrixfederationclient.py", line 1077, in get_json
2022-11-24T07:20:08.1955715Z synapse_main |     response = await self._send_request_with_optional_trailing_slash(
2022-11-24T07:20:08.1956105Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/http/matrixfederationclient.py", line 406, in _send_request_with_optional_trailing_slash
2022-11-24T07:20:08.1956269Z synapse_main |     response = await self._send_request(request, **send_request_args)
2022-11-24T07:20:08.1956603Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/http/matrixfederationclient.py", line 503, in _send_request
2022-11-24T07:20:08.1956774Z synapse_main |     limiter = await synapse.util.retryutils.get_retry_limiter(
2022-11-24T07:20:08.1957086Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/util/retryutils.py", line 106, in get_retry_limiter
2022-11-24T07:20:08.1957218Z synapse_main |     raise NotRetryingDestination(
2022-11-24T07:20:08.1957738Z synapse_main | synapse.util.retryutils.NotRetryingDestination: Not retrying server host.docker.internal:33837 because we tried it recently retry_last_ts=1669273939360 and we won't check for another retry_interval=600000ms.
2022-11-24T07:20:08.1957809Z synapse_main | 
2022-11-24T07:20:08.1957985Z synapse_main | During handling of the above exception, another exception occurred:
2022-11-24T07:20:08.1958060Z synapse_main | 
2022-11-24T07:20:08.1958182Z synapse_main | Traceback (most recent call last):
2022-11-24T07:20:08.1958511Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/metrics/background_process_metrics.py", line 242, in run
2022-11-24T07:20:08.1958633Z synapse_main |     return await func(*args, **kwargs)
2022-11-24T07:20:08.1958970Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/handlers/federation.py", line 1713, in _sync_partial_state_room
2022-11-24T07:20:08.1959151Z synapse_main |     await self._federation_event_handler.update_state_for_partial_state_event(
2022-11-24T07:20:08.1959563Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/handlers/federation_event.py", line 584, in update_state_for_partial_state_event
2022-11-24T07:20:08.1959739Z synapse_main |     context = await self._compute_event_context_with_maybe_missing_prevs(
2022-11-24T07:20:08.1960043Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/logging/opentracing.py", line 889, in _wrapper
2022-11-24T07:20:08.1960190Z synapse_main |     return await func(*args, **kwargs)  # type: ignore[misc]
2022-11-24T07:20:08.1960570Z synapse_main |   File "/usr/local/lib/python3.9/site-packages/synapse/handlers/federation_event.py", line 1072, in _compute_event_context_with_maybe_missing_prevs
2022-11-24T07:20:08.1960681Z synapse_main |     raise FederationError(
2022-11-24T07:20:08.1960965Z synapse_main | synapse.api.errors.FederationError: ERROR 403: We can't get valid state history.

The test in question uses the @t10alice user, which is Lazy-loading incremental sync includes remote memberships during partial state join.

We have to allow Synapse to finish ongoing partial joins by the end of every test, otherwise this class of flake will happen.

@DMRobertson
Copy link
Contributor Author

We have to allow Synapse to finish ongoing partial joins by the end of every test, otherwise this class of flake will happen.

Does that amount to calling psjResult.FinishStateRequest() and waiting for a sync to confirm the join was successful?

@squahtx
Copy link
Contributor

squahtx commented Dec 8, 2022

That sounds about right.
We do do the equivalent of psjResult.FinishStateRequest() in the Destroy cleanup method, but don't stick around to confirm Synapse is done. We're probably shutting down too fast after that.

There's an awaitPartialStateJoinCompletion method that works by waiting for /members to respond. /sync isn't reliable, since it'll respond once the join event has been un-partial stated and doesn't guarantee that any events after that have been un-partial stated.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federated-Join joins over federation generally suck A-Testing Issues related to testing in complement, synapse, etc T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact Z-Flake Tests that give intermittent failures
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants