Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

blacklist servers which are down for several days #5113

Closed
richvdh opened this issue Apr 29, 2019 · 18 comments
Closed

blacklist servers which are down for several days #5113

richvdh opened this issue Apr 29, 2019 · 18 comments
Assignees
Labels
z-outbound-federation-meltdown (Deprecated Label) Synapse melting down by trying to talk to too many servers z-p2 (Deprecated Label)

Comments

@richvdh
Copy link
Member

richvdh commented Apr 29, 2019

Why don't we add servers which fail to respond to federation requests for several days to a blacklist, and stop trying?

This could significantly reduce network traffic, CPU usage, and the amount of cruft that gets logged.

We would need to unblock such servers when we receive a valid request from them.

@aaronraimist
Copy link
Contributor

aaronraimist commented Apr 29, 2019

See https://gist.github.com/Sharparam/b144e294189d78ee6c73df0e109ee2af for one week of data showing the number of times per day that @Sharparam's server contacted a bunch of dead servers.

See convo starting from https://matrix.to/#/!HsxjoYRFsDtWBgDQPh:matrix.org/$155655145216EFGaa:matrix.sharparam.com?via=matrix.org&via=chat.weho.st&via=hackerspaces.be

@richvdh
Copy link
Member Author

richvdh commented Apr 29, 2019

Note that we already have an exponential-backoff algorithm, but it tops out at a 24hr retry period.

After the first failed request, we back off for 10 minutes. We then increase the backoff by a factor of between 4 and 7 after each failed request, until we get to 24 hours. The retry intervals are therefore:

  • 10min
  • ~ 55min
  • ~ 5hr
  • 24hr
  • 24hr
  • ...

An easy way to implement this would be to go to 999 years after 24 hours.

@neilisfragile neilisfragile added enhancement z-p2 (Deprecated Label) labels Apr 29, 2019
@richvdh
Copy link
Member Author

richvdh commented Apr 29, 2019

I should also mention that each request is retried several times (normally 10), with its own exponential-backoff loop. The per-host exponential backoff is only increased after a request fails completely.

@mguentner
Copy link

mguentner commented Apr 29, 2019

The worst backoff time of a server should be calculated to know how long
operators of homeservers have time to fix issues. This default needs to be
communicated to homeserver admins.

Also, if a server is down for one week or so this shouldn't result in a blacklist if it is no easily reversible.
Such a reversal could be that the blacklisted server contacts the server again that put its blacklist.

An observation:
I moved an homeserver I operate it to a dedicated machine some days ago.
Before it was running on domain.tld, now it runs on synapse.domain.tld

To do that, I created a .well-known/matrix/server file and changed the
SRV record from domain.tld to synapse.domain.tld.
The main domain (without the host part, synapse) however appears now in the gist referenced in #5113 (comment)
My homeserver happily federates with the network as far as I can see and the fed tester
is also green.
Adding a .well-known/matrix/server file or changing / adding the SRV record mustn't
result in a ban, even if cached DNS results or GET requests of .well-known/matrix/server
let requests fail.

@Sharparam
Copy link

@mguentner What is your server hostname so I can check the logs more closely for errors related to it?

@kroeckx
Copy link

kroeckx commented May 1, 2019

@Sharparam: My server (roeckx.be) seems to have high numbers in that file, but the server has always been up.

@Sharparam
Copy link

Sharparam commented May 1, 2019

@kroeckx Your server is responding with 400: Bad Request to federation attempts.

Edit: The latest entry from the dumped logs:

2019-04-29 15:32:44,450 - synapse.http.matrixfederationclient - 472 - WARNING - federation_transaction_transmission_loop-242514- {PUT-O-173776} [roeckx.be] Request failed: PUT matrix://roeckx.be/_matrix/federation/v1/send/1556427433382: HttpResponseException("400: b'Bad Request'",)

@kroeckx
Copy link

kroeckx commented May 1, 2019 via email

@Sharparam
Copy link

Sharparam commented May 1, 2019

For the record, some discussion related to this in #synapse:matrix.org starting from this message.

Edit: Looks like you might need to update your Synapse instance @kroeckx. You are on 0.99.2 and the latest Synapse is 0.99.3, apparently something regarding federation was changed.

@mguentner
Copy link

@Sharparam Aha! My instance was still on 0.99.2 as well but now runs on 0.99.3
It would have been a bad reason to ban my homeserver though ;)

@Sharparam
Copy link

Sharparam commented May 2, 2019

@mguentner Yeah, my analysis only looks at the errors logged by Synapse and that doesn't take into account that special case where there is a failed call immediately followed by a successful one. The blacklisting code would have to take this into account somehow. (This actually might resolve itself if the servers are removed from blacklist as soon as a successful call is made or received.)

@richvdh
Copy link
Member Author

richvdh commented May 2, 2019

The blacklisting code would have to take this into account somehow.

the current backoff code ignores 400s, ftr.

@Sharparam
Copy link

I'm not sure if that's a solution though, there could be 400 errors that are not resolved by sending again with a slash.

@aaronraimist
Copy link
Contributor

aaronraimist commented Jun 7, 2019

Seems like the things that cause a backoff have not been updated in several years https://github.com/matrix-org/synapse/blob/develop/synapse/util/retryutils.py#L177

I just pulled my last 4 hours of logs from 1.0.0rc1:

33% of the WARNING lines are RequestTransmissionFailed:[Error([('SSL routines', 'CONNECT_CR_CERT', 'certificate verify failed')],)]
16% TimeoutError('',)
15% DNSLookupError('no results for hostname lookup...
12% ConnectionRefusedError('Connection refused',)
10% Some form of DNSMismatch
7% HttpResponseException("400: b'Bad Request'",)

and a fewConnectError('Address not available',), HttpResponseException("403: b'Forbidden'",) and HttpResponseException("405: b'Method Not Allowed'",)

It doesn't seem like these backoff even though I believe most of these should.

@richvdh
Copy link
Member Author

richvdh commented Jun 9, 2019

@aaronraimist: what makes you think that those do not cause backoff? As far as I know they all should (most of them do not derive from CodeMessagesException so the line you have linked to is not relevant). In any case it's a separate problem, so please open a new issue.

@hawkowl hawkowl self-assigned this Jul 11, 2019
@hawkowl hawkowl added the z-outbound-federation-meltdown (Deprecated Label) Synapse melting down by trying to talk to too many servers label Jul 11, 2019
richvdh added a commit that referenced this issue Sep 12, 2019
Essentially the intention here is to end up blacklisting servers which never
respond to federation requests.

Fixes #5113.
richvdh added a commit that referenced this issue Sep 12, 2019
Essentially the intention here is to end up blacklisting servers which never
respond to federation requests.

Fixes #5113.
@richvdh
Copy link
Member Author

richvdh commented Sep 12, 2019

#6026

@DMRobertson
Copy link
Contributor

We would need to unblock such servers when we receive a valid request from them.

Did this happen?

@richvdh
Copy link
Member Author

richvdh commented Nov 22, 2022

yes, valid requests received will reset the backoff.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-outbound-federation-meltdown (Deprecated Label) Synapse melting down by trying to talk to too many servers z-p2 (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

8 participants