Quarkus + Stork/consul round-robin service discovery cache expiration over nodes that are down (Baremetal) #24343

pjgg · 2022-03-16T10:48:05Z

Describe the bug

QuarkusVersion: 2.7.4.Final
Reproducer: quarkus-qe/quarkus-test-suite#572
cmd: mvn clean verify -Dall-modules -pl service-discovery/stork-consul -Dit.test=StorkServiceDiscoveryIT#storkLoadBalancerServiceNodeDown

Even if Qaurkus/stork is not fault-tolerant and doesn't "detect" that a service node is down, there is a cache expiration property
stork.pong-replica.service-discovery.refresh-period that in combination with a "retry" policy could do the "job". However, if a node is down and the cache has already expired, the Stork load balancer still dispatching requests to those nodes.

Expected behavior

If a service node is down and cache expiration time has exceeded, I expected that Quarkus/stork only add a configuration into the cache if the service node is up and ready (maybe by calling to /q/health/ready)

Actual behavior

A service node that is down is in Stork as an available node even if the stork cache has expired.

How to Reproduce?

Reproducer: quarkus-qe/quarkus-test-suite#572

cmd: mvn clean verify -Dall-modules -pl service-discovery/stork-consul -Dit.test=StorkServiceDiscoveryIT#storkLoadBalancerServiceNodeDown

Output of `uname -a` or `ver`

No response

Output of `java -version`

No response

GraalVM version (if different from Java)

No response

Quarkus version or git rev

No response

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

quarkus-bot · 2022-03-16T10:48:08Z

/cc @gwenneg, @michalszynkiewicz

michalszynkiewicz · 2022-03-16T20:36:27Z

So you'd like Stork to remove a service instance from the ones the client tries if the instance is not available?

pjgg · 2022-03-17T10:50:44Z

I think that makes sense to check if the service is ready before adding this service to the pool of services, and if is not ready then remove this instance from the pool. By this way, we will avoid unnecessary retries on the client-side.

Current behavior:

Client(with retry policy) -> stork(service discovery + LB) -> service_one
                                                              service_one_replica (down)

In the end, all requests will succeed, but the price is too high (in terms of request and load to the available node). Also, maybe the service is up...but in another k8s node, because...was moved to another node (AWS spot instances or ephemeral nodes(cloud)). In those cases, the new service in another node is going to be registered again, but the old one is going to remain also in the pool.

If by configuration could be possible to let Quarkus/stork know where the readiness URL is located, then this config could be used by Quarkus/stork when a service is added or stork.yourService.service-discovery.refresh-period expires, and avoid these unnecessary retries.

michalszynkiewicz · 2022-03-17T11:15:27Z

It is out of scope of Stork to perform heartbeat for service instances.
Service discovery solutions, such as Consul, can do it, and Stork can ask Consul for healthy services only (there is a configuration property for it).

Also, it is possible for a load balancer add an instnace to some kind of block list after a failure. Out of existing load balancers, the least-response-time one is closest to doing it but it just treats failures as a very long response time, so that less requests are directed to such an instance.

Does this answer your doubts?

pjgg added the kind/bug Something isn't working label Mar 16, 2022

quarkus-bot bot added the area/cache label Mar 16, 2022

quarkus-bot bot added the area/stork label Mar 16, 2022

pjgg changed the title ~~Quarkus + Stork/consul round-robin load balancer cache expiration (Baremetal)~~ Quarkus + Stork/consul round-robin load balancer cache expiration over nodes that are down (Baremetal) Mar 16, 2022

pjgg mentioned this issue Mar 16, 2022

Baremetal / Add stork consul scenario quarkus-qe/quarkus-test-suite#572

Merged

pjgg changed the title ~~Quarkus + Stork/consul round-robin load balancer cache expiration over nodes that are down (Baremetal)~~ Quarkus + Stork/consul round-robin service discovery cache expiration over nodes that are down (Baremetal) Mar 16, 2022

gwenneg removed the area/cache label Mar 17, 2022

pjgg closed this as completed Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quarkus + Stork/consul round-robin service discovery cache expiration over nodes that are down (Baremetal) #24343

Quarkus + Stork/consul round-robin service discovery cache expiration over nodes that are down (Baremetal) #24343

pjgg commented Mar 16, 2022 •

edited

Loading

quarkus-bot bot commented Mar 16, 2022

michalszynkiewicz commented Mar 16, 2022

pjgg commented Mar 17, 2022

michalszynkiewicz commented Mar 17, 2022

Quarkus + Stork/consul round-robin service discovery cache expiration over nodes that are down (Baremetal) #24343

Quarkus + Stork/consul round-robin service discovery cache expiration over nodes that are down (Baremetal) #24343

Comments

pjgg commented Mar 16, 2022 • edited Loading

Describe the bug

Expected behavior

Actual behavior

How to Reproduce?

Output of uname -a or ver

Output of java -version

GraalVM version (if different from Java)

Quarkus version or git rev

Build tool (ie. output of mvnw --version or gradlew --version)

Additional information

quarkus-bot bot commented Mar 16, 2022

michalszynkiewicz commented Mar 16, 2022

pjgg commented Mar 17, 2022

michalszynkiewicz commented Mar 17, 2022

pjgg commented Mar 16, 2022 •

edited

Loading

Output of `uname -a` or `ver`

Output of `java -version`

Build tool (ie. output of `mvnw --version` or `gradlew --version`)