Redis cluster doesn't reconnect when node returns online #37348

sfali16 · 2023-11-28T08:16:13Z

Description

There is an issue with reconnects - take the starter code from Issue 37041. If you shutdown the redis cluster, issue a request that connects to redis which fails, then restart the cluster, subsequent requests don't reconnect to redis.
Vert.x redis client purposely doesn't implement client reconnects - quarkus should probably do that.

Reproduction steps:

Credit: @bartm-dvb Clone repo: https://github.com/bartm-dvb/quarkus-redis-bug

Start Redis in cluster mode with docker-compose up
Start quarkus. ./mvnw quarkus:dev
Access http://localhost:8080/cat-fact

Stop redis (docker compose down)
access cat-fact again -> see the following error in the quarkus log:
2023-11-17 22:27:36,551 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-1) HTTP Request to /cat-fact failed, error id: 736e1d37-7d84-43fb-8e37-2d4d34f4eda6-11: io.vertx.core.impl.NoStackTraceThrowable: Cannot connect to any of the provided endpoints"
Start redis. Access cat-fact - see above error repeat over and over until quarkus is restarted. Note - when you stop redis, don't attempt to reconnect, restart redis, then the the error never occurs.

Implementation ideas

Solution:
Quarkus should automatically handle reconnecting to redis if the redis cluster has become unavailable during some requests. At a minimum, in this application, quarkus should (in the absence of cache) retrieve the cat fact by going out to the external service when the cache fails.

The text was updated successfully, but these errors were encountered:

quarkus-bot · 2023-11-28T08:16:17Z

/cc @cescoffier (redis), @gsmet (redis), @machi1990 (redis)

sfali16 · 2023-11-28T08:36:14Z

If I'm interpretting this RedisCacheImpl code correctly, then if a ConnectionException was thrown by Vertx redis client, then the RedisCache in quarkus should retrieve the non-cached value from 'the valueloader', so maybe quarkus is already supposed to do the second part of my suggested implementation, i.e. retrieve the cat fact by going out to the external service when the cache fails.

quarkus/extensions/redis-cache/runtime/src/main/java/io/quarkus/cache/redis/runtime/RedisCacheImpl.java

Line 235 in 670b43c

.onFailure(ConnectException.class).recoverWithUni(e -> {

               .onFailure(ConnectException.class).recoverWithUni(e -> {
                    log.warn("Unable to connect to Redis, recomputing cached value", e);
                    return valueLoader.apply(key);
                });

cescoffier · 2023-11-28T08:54:32Z

@Ladicek Looking at https://vertx.io/docs/vertx-redis-client/java/#_implementing_reconnect_on_error. When a failure happens, we must re-create the client (and the pool). If so, we would need a facade that handles that. I'm worried that one connection having an error requires completely recreating the client and pools (meaning the other connection may still be fine).

WDYT?

Ladicek · 2023-11-28T09:09:10Z

The example code in Vert.x Redis client documentation is pretty naive. There's a good reason for that -- failure detection is hard. However, I believe that the simple failure mode (Redis has gone) should be handled transparently by the connection pool. If a connection fails, it should be evicted from the pool, and a new one should be added, which should effectively implement reconnection. I need to check why it doesn't work like that.

EDIT: of course, what I'm suggesting leads to propagating errors to user code. I don't see an issue with that.

Ladicek · 2023-11-28T09:33:24Z

Heh, this is funny. It actually works exactly as I expect (in my previous comment) when using a standalone Redis connection. When I configure a cluster connection, it falls apart:

The Redis-based cache doesn't catch the error and doesn't invoke the "cached" method, because the exception is not a ConnectException, but NoStackTraceThrowable: Cannot connect to any of the provided endpoints.
If I bring Redis back, it doesn't reconnect, it keeps failing with NoStackTraceThrowable: Cannot connect to any of the provided endpoints.

Ladicek · 2023-11-28T12:05:09Z

The 1st issue mentioned above is easy to solve. There are 2 basic ways to do it, either on the Quarkus side (detect NoStackTraceThrowable and inspect the exception message), or on the Vert.x Redis client side (use a subclass of ConnectException). I believe solving it on the Redis client side is better, but we could do both. I don't think that's necessary, though, because of the other issue, which needs to be solved on the Vert.x Redis client side.

The 2nd issue took me a while to figure out. When the Redis client connects to a cluster, it first obtains the hash slot assignment. To prevent overloading the first node in the list, the hash slot assignment is cached for a brief period of time (1 second by default). However, we only set up a timer to expire that miniature cache when we obtain the hash slot assignment successfully. If the CLUSTER SLOTS operation fails, we never expire the cache, so all other attempts to connect to the Redis cluster fail with the cached error. I implemented this caching mechanism, and I don't know what I was thinking 🤦

I'll submit PRs to Vert.x Redis client in a bit.

Ladicek · 2023-11-28T16:22:32Z

I ended up amending my existing PRs that were not merged yet, because it's essentially the same area of improvements:

cescoffier · 2023-11-30T09:00:01Z

Awesome! Thanks @Ladicek

Ladicek · 2024-02-22T08:44:49Z

The issues here were fixed in Vert.x Redis client 4.5.1. Quarkus updated to Vert.x 4.5.1 in 3.7.0 (see #38034), hence closing this.

sfali16 added the kind/enhancement New feature or request label Nov 28, 2023

quarkus-bot bot added the area/redis label Nov 28, 2023

Ladicek mentioned this issue Nov 28, 2023

Redis cluster improvements vert-x3/vertx-redis-client#419

Merged

cescoffier added area/vertx triage/upstream labels Nov 30, 2023

Ladicek closed this as completed Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis cluster doesn't reconnect when node returns online #37348

Redis cluster doesn't reconnect when node returns online #37348

sfali16 commented Nov 28, 2023 •

edited

Loading

quarkus-bot bot commented Nov 28, 2023

sfali16 commented Nov 28, 2023 •

edited

Loading

cescoffier commented Nov 28, 2023

Ladicek commented Nov 28, 2023 •

edited

Loading

Ladicek commented Nov 28, 2023

Ladicek commented Nov 28, 2023

Ladicek commented Nov 28, 2023

cescoffier commented Nov 30, 2023

Ladicek commented Feb 22, 2024

Redis cluster doesn't reconnect when node returns online #37348

Redis cluster doesn't reconnect when node returns online #37348

Comments

sfali16 commented Nov 28, 2023 • edited Loading

Description

Implementation ideas

quarkus-bot bot commented Nov 28, 2023

sfali16 commented Nov 28, 2023 • edited Loading

cescoffier commented Nov 28, 2023

Ladicek commented Nov 28, 2023 • edited Loading

Ladicek commented Nov 28, 2023

Ladicek commented Nov 28, 2023

Ladicek commented Nov 28, 2023

cescoffier commented Nov 30, 2023

Ladicek commented Feb 22, 2024

sfali16 commented Nov 28, 2023 •

edited

Loading

sfali16 commented Nov 28, 2023 •

edited

Loading

Ladicek commented Nov 28, 2023 •

edited

Loading