Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Tests for IPv6 bugs / redis replication over IPv6 #11508

Closed
wants to merge 0 commits into from

Conversation

evilham
Copy link
Contributor

@evilham evilham commented Dec 5, 2021

This will help close some long-standing issues:
("X" means we are adding a test for it)
[ ] AS #4092
[ ] metrics EP #6644
[X] replication #7695
[X] smtp #7720
[X] redis #10694

In order to run these IPv6-related tests, the test suite must run with:
env SYNAPSE_POSTGRES=YES DEBUG_IPv6=YES trial tests

Sponsored by: ungleich.ch
Signed-off-by: @telmich Nico Schottelius [email protected]

@evilham evilham requested a review from a team as a code owner December 5, 2021 11:45
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the effort, but I don't really see the value of adding tests which are disabled by default. They'll just bitrot until they get removed again (see https://github.com/matrix-org/synapse/pull/11495/files#diff-cda5424a49ddae4fe5181900d1b607a9daf5c9afa539ed1fdb01622de069e499L198 for a recent example).

@evilham
Copy link
Contributor Author

evilham commented Dec 6, 2021

Hey @richvdh, thanks for taking a look.

Maybe I should have elaborated a tad more:
the motivation with @telmich to start with these tests is that they provide a very simple path to check that these known bugs are solved without breaking Synapse's pipeline, and a simpler way to fix the issue with replication workers (the main motivation for this), since that is indeed more involved than it looks like at first.
Note that the tests are not really new, they re-use the code from existing tests, just with different fake DNS entries and changes in tests/server.py mean that this whole class of issues will be caught in the future, so that is also a win.

Indeed, I'd recommend that @xfk abandons and re-opens #10717 (comment) with just the fix for the SMTP bug (#7720) and enabling the corresponding test by default, that way we both check that the bug is fixed and that regressions do not happen.
If they do not in a reasonable time, since the original fix was somewhat simple and already signed-off, I'd see that this happens.

Also based on your comment on #10717 (comment) I started re-working the replication bits, which is the most complex of these bugs, and will need to ask a couple questions to finish that.

@richvdh
Copy link
Member

richvdh commented Dec 6, 2021

Yes, having the tests is very useful, and your efforts in contributing them are very much appreciated. It's just that from a practical point of view I'd much rather we combine the fix and the tests into the same PR - so for example let's have a PR which fixes #7720 and lands the new test.

Perhaps we can mark this PR as a draft for now so that people can use it as inspiration for writing those PRs?

@evilham evilham marked this pull request as draft December 6, 2021 11:05
@evilham
Copy link
Contributor Author

evilham commented Dec 7, 2021

OK, let's redirect this PR into fixing #10694 (and #7695 as a particular case), if in the meantime @xfk wants to open a PR with the fix for #7720 and the corresponding (passing and always enabled) test from this PR, that'd be great.

Based on previous discussion and particularly this comment #10717 (comment), I've been looking into refactoring the communication with Redis using a ClientService, which proves more challenging than I first expected.

Here are some things I've noticed, please correct if any of these is wrong since understanding this right is quite important:

  • Redis is only ever used through synapse.server.HomeServer.get_outbound_redis_connection and RedisDirectTcpReplicationClientFactory
  • The outbound_redis_connection is used in the synapse.replication.tcp.external_cache.ExternalCache which is using the RedisProtocol directly.
  • This ExternalCache is used in synapse.federation.sender.FederationSender in a subscription, read-only fashion, but it is used in a write-fashion by synapse.handlers/message.EventCreationHandler. This is a tad unexpected and confusing.
  • It is also used by the synapse.replication.tcp.handler.ReplicationCommandHandler, passing it to synapse.replication.tcp.redis.RedisDirectTcpReplicationClientFactory.
  • RedisDirectTcpReplicationClientFactory creates its own connection to Redis and it is used in tandem with the outbound_redis_connection, it uses the outbound_redis_connection in a write-fashion to send commands and the associated connection is used in a subscription, read-only fashion.

Now, the doubts:

  • synapse.replication.tcp.redis.lazyConnection (used for the outbound_redis_connection) claims to return a RedisProtocol, but it realy returns a txredisapi.ConnectionHandler (see:
    return factory.handler
    ). Which.... yes and no, it wraps the behaviour of a RedisProtocol, but it is doing some magic regarding connection pools. I don't think this should happen.
  • There is a periodic ping that claims to keep the connection alive, I wonder if this would be needed if we switch to ClientService, how exactly did txredisapi "stop"? From my understanding of Twisted, the instance of the RedisProtocol should be treated as something that is bound to change, so if the connection is lost, you should get another one, with the proper factories and all that, it should be ready to use. Maybe the fact that txredisapi is exposing weird objects like the ConnectionHandler lead to treating it as something semi-persistent and to these issues?

My overall plan to make this happen is:

  • Stay away from txredisapi.ConnectionHandler and txredisapi.RedisFactory, they are trying to do way too much and are inflexible.
  • Create our own SimpleRedisFactory (which might be upstreamed to txredisapi in the future, but in the meantime should live in synapse as well due to packaging) which does none of the reconnection and pool magic.
  • Use this SimpleRedisFactory with a RedisClientService to have free connections with exponential backoff and reset of the timers.
  • Never pass the RedisProtocol from synapse.server.HomeServer (or ever), but instead the RedisClientService which should have ways of letting users of the service know when the connection is lost / recovered.
  • RedisDirectTcpReplicationClientFactory should use two such RedisClientService objects, the one passed by the HomeServer and its own connection service.

Does this sound OK? am I missing something?

@evilham evilham changed the title Add optional tests documenting misbehaviour in IPv6-only scenarios Tests for IPv6 bugs / redis replication over IPv6 Dec 7, 2021
@richvdh
Copy link
Member

richvdh commented Dec 7, 2021

  • Redis is only ever used through synapse.server.HomeServer.get_outbound_redis_connection and RedisDirectTcpReplicationClientFactory
  • The outbound_redis_connection is used in the synapse.replication.tcp.external_cache.ExternalCache which is using the RedisProtocol directly.
  • This ExternalCache is used in synapse.federation.sender.FederationSender in a subscription, read-only fashion, but it is used in a write-fashion by synapse.handlers/message.EventCreationHandler. This is a tad unexpected and confusing.
  • It [outbound_redis_connection] is also used by the synapse.replication.tcp.handler.ReplicationCommandHandler, passing it to synapse.replication.tcp.redis.RedisDirectTcpReplicationClientFactory.
  • RedisDirectTcpReplicationClientFactory creates its own connection to Redis and it is used in tandem with the outbound_redis_connection, it uses the outbound_redis_connection in a write-fashion to send commands and the associated connection is used in a subscription, read-only fashion.

This is all correct.

This [use of ExternalCache] is a tad unexpected and confusing.

I don't think you need to worry too much about how ExternalCache works: for our purposes it's just a class which needs access to a redis connection where it can send commands and receive the replies. But I'm not sure which part is unexpected: the fact that EventCreationHandler writes to the cache? What would be the use of a cache which is never populated?

Now, the doubts:

  • synapse.replication.tcp.redis.lazyConnection (used for the outbound_redis_connection) claims to return a RedisProtocol, but it realy returns a txredisapi.ConnectionHandler. Which.... yes and no, it wraps the behaviour of a RedisProtocol, but it is doing some magic regarding connection pools. I don't think this should happen.

Maybe. It would certainly be nice if there were a type definition for the interface exposed by RedisProtocol, so that lazyConnection can say "here is a thing which behaves the same as a RedisProtocol, but isn't one". But txredisapi doesn't define such a class.

  • There is a periodic ping that claims to keep the connection alive,

Did you link to the right line here? That's nothing to do with pings, but rather about whether txredisapi.RedisFactory reconnects after a connection failure. But I think it is specific to RedisFactory and not required for ClientService.

My overall plan to make this happen is:

  • Stay away from txredisapi.ConnectionHandler and txredisapi.RedisFactory, they are trying to do way too much and are inflexible.

+1

  • Create our own SimpleRedisFactory (which might be upstreamed to txredisapi in the future, but in the meantime should live in synapse as well due to packaging) which does none of the reconnection and pool magic.

You may find that a factory created with Factory.forProtocol is sufficent here.

  • Use this SimpleRedisFactory with a RedisClientService to have free connections with exponential backoff and reset of the timers.

Are you intending to make RedisClientService a derived class of ClientService? I would avoid doing so - either wrap the ClientService inside another class, or just pass the ClientService around.

  • Never pass the RedisProtocol from synapse.server.HomeServer (or ever), but instead the RedisClientService which should have ways of letting users of the service know when the connection is lost / recovered.

I'm not sure if you actually need to know that? Could you just call ClientService.whenConnected to get the current connection?

  • RedisDirectTcpReplicationClientFactory should use two such RedisClientService objects, the one passed by the HomeServer and its own connection service.

Does this sound OK? am I missing something?

I think it sounds about right. To be honest, I'm struggling to figure out how it all fits together when looking at it in the abstract.

@richvdh
Copy link
Member

richvdh commented Apr 21, 2022

@evilham: are you still interested in working on this?

@richvdh richvdh added the X-Awaiting-Changes A contributed PR which needs changes and re-review before it can be merged label Apr 21, 2022
@evilham
Copy link
Contributor Author

evilham commented Apr 21, 2022

Yep, I want to get this fixed and it's on my TODO, but things have gotten in the way >,<. Will try to re-prioritise, thanks for the ping!

@clokep
Copy link
Member

clokep commented Jan 25, 2023

This has a bunch of conflicts that need to be solved; I wonder if a more reasonable way to handle this would be to:

  • Close this PR;
  • Take the tests for each item you've listed above (e.g. SMTP) and open a new PR with just those tests;
  • Fix support for whatever that item tests in the same PR;
  • Review it!

Hopefully this is pretty doable and folks can crib off this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
X-Awaiting-Changes A contributed PR which needs changes and re-review before it can be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants