Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis cluster doesn't reconnect when node returns online #37348

Closed
sfali16 opened this issue Nov 28, 2023 · 9 comments
Closed

Redis cluster doesn't reconnect when node returns online #37348

sfali16 opened this issue Nov 28, 2023 · 9 comments
Labels

Comments

@sfali16
Copy link

sfali16 commented Nov 28, 2023

Description

Creating issue per #37041 (comment)

There is an issue with reconnects - take the starter code from Issue 37041. If you shutdown the redis cluster, issue a request that connects to redis which fails, then restart the cluster, subsequent requests don't reconnect to redis.
Vert.x redis client purposely doesn't implement client reconnects - quarkus should probably do that.

Reproduction steps:

  1. Credit: @bartm-dvb Clone repo: https://github.com/bartm-dvb/quarkus-redis-bug
  1. Stop redis (docker compose down)
  2. access cat-fact again -> see the following error in the quarkus log:
    2023-11-17 22:27:36,551 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-1) HTTP Request to /cat-fact failed, error id: 736e1d37-7d84-43fb-8e37-2d4d34f4eda6-11: io.vertx.core.impl.NoStackTraceThrowable: Cannot connect to any of the provided endpoints"
  3. Start redis. Access cat-fact - see above error repeat over and over until quarkus is restarted. Note - when you stop redis, don't attempt to reconnect, restart redis, then the the error never occurs.

Implementation ideas

Solution:
Quarkus should automatically handle reconnecting to redis if the redis cluster has become unavailable during some requests. At a minimum, in this application, quarkus should (in the absence of cache) retrieve the cat fact by going out to the external service when the cache fails.

@sfali16 sfali16 added the kind/enhancement New feature or request label Nov 28, 2023
Copy link

quarkus-bot bot commented Nov 28, 2023

/cc @cescoffier (redis), @gsmet (redis), @machi1990 (redis)

@sfali16
Copy link
Author

sfali16 commented Nov 28, 2023

If I'm interpretting this RedisCacheImpl code correctly, then if a ConnectionException was thrown by Vertx redis client, then the RedisCache in quarkus should retrieve the non-cached value from 'the valueloader', so maybe quarkus is already supposed to do the second part of my suggested implementation, i.e. retrieve the cat fact by going out to the external service when the cache fails.

               .onFailure(ConnectException.class).recoverWithUni(e -> {
                    log.warn("Unable to connect to Redis, recomputing cached value", e);
                    return valueLoader.apply(key);
                });

@cescoffier
Copy link
Member

@Ladicek Looking at https://vertx.io/docs/vertx-redis-client/java/#_implementing_reconnect_on_error. When a failure happens, we must re-create the client (and the pool). If so, we would need a facade that handles that. I'm worried that one connection having an error requires completely recreating the client and pools (meaning the other connection may still be fine).

WDYT?

@Ladicek
Copy link
Contributor

Ladicek commented Nov 28, 2023

The example code in Vert.x Redis client documentation is pretty naive. There's a good reason for that -- failure detection is hard. However, I believe that the simple failure mode (Redis has gone) should be handled transparently by the connection pool. If a connection fails, it should be evicted from the pool, and a new one should be added, which should effectively implement reconnection. I need to check why it doesn't work like that.

EDIT: of course, what I'm suggesting leads to propagating errors to user code. I don't see an issue with that.

@Ladicek
Copy link
Contributor

Ladicek commented Nov 28, 2023

Heh, this is funny. It actually works exactly as I expect (in my previous comment) when using a standalone Redis connection. When I configure a cluster connection, it falls apart:

  1. The Redis-based cache doesn't catch the error and doesn't invoke the "cached" method, because the exception is not a ConnectException, but NoStackTraceThrowable: Cannot connect to any of the provided endpoints.
  2. If I bring Redis back, it doesn't reconnect, it keeps failing with NoStackTraceThrowable: Cannot connect to any of the provided endpoints.

@Ladicek
Copy link
Contributor

Ladicek commented Nov 28, 2023

The 1st issue mentioned above is easy to solve. There are 2 basic ways to do it, either on the Quarkus side (detect NoStackTraceThrowable and inspect the exception message), or on the Vert.x Redis client side (use a subclass of ConnectException). I believe solving it on the Redis client side is better, but we could do both. I don't think that's necessary, though, because of the other issue, which needs to be solved on the Vert.x Redis client side.

The 2nd issue took me a while to figure out. When the Redis client connects to a cluster, it first obtains the hash slot assignment. To prevent overloading the first node in the list, the hash slot assignment is cached for a brief period of time (1 second by default). However, we only set up a timer to expire that miniature cache when we obtain the hash slot assignment successfully. If the CLUSTER SLOTS operation fails, we never expire the cache, so all other attempts to connect to the Redis cluster fail with the cached error. I implemented this caching mechanism, and I don't know what I was thinking 🤦

I'll submit PRs to Vert.x Redis client in a bit.

@Ladicek
Copy link
Contributor

Ladicek commented Nov 28, 2023

I ended up amending my existing PRs that were not merged yet, because it's essentially the same area of improvements:

@cescoffier
Copy link
Member

Awesome! Thanks @Ladicek

@Ladicek
Copy link
Contributor

Ladicek commented Feb 22, 2024

The issues here were fixed in Vert.x Redis client 4.5.1. Quarkus updated to Vert.x 4.5.1 in 3.7.0 (see #38034), hence closing this.

@Ladicek Ladicek closed this as completed Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants