Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quarkus + Stork/consul round-robin service discovery cache expiration over nodes that are down (Baremetal) #24343

Closed
pjgg opened this issue Mar 16, 2022 · 4 comments
Labels
area/stork kind/bug Something isn't working

Comments

@pjgg
Copy link
Contributor

pjgg commented Mar 16, 2022

Describe the bug

QuarkusVersion: 2.7.4.Final
Reproducer: quarkus-qe/quarkus-test-suite#572
cmd: mvn clean verify -Dall-modules -pl service-discovery/stork-consul -Dit.test=StorkServiceDiscoveryIT#storkLoadBalancerServiceNodeDown

Even if Qaurkus/stork is not fault-tolerant and doesn't "detect" that a service node is down, there is a cache expiration property
stork.pong-replica.service-discovery.refresh-period that in combination with a "retry" policy could do the "job". However, if a node is down and the cache has already expired, the Stork load balancer still dispatching requests to those nodes.

Expected behavior

If a service node is down and cache expiration time has exceeded, I expected that Quarkus/stork only add a configuration into the cache if the service node is up and ready (maybe by calling to /q/health/ready)

Actual behavior

A service node that is down is in Stork as an available node even if the stork cache has expired.

How to Reproduce?

Reproducer: quarkus-qe/quarkus-test-suite#572

cmd: mvn clean verify -Dall-modules -pl service-discovery/stork-consul -Dit.test=StorkServiceDiscoveryIT#storkLoadBalancerServiceNodeDown

Output of uname -a or ver

No response

Output of java -version

No response

GraalVM version (if different from Java)

No response

Quarkus version or git rev

No response

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

@pjgg pjgg added the kind/bug Something isn't working label Mar 16, 2022
@quarkus-bot
Copy link

quarkus-bot bot commented Mar 16, 2022

/cc @gwenneg, @michalszynkiewicz

@pjgg pjgg changed the title Quarkus + Stork/consul round-robin load balancer cache expiration (Baremetal) Quarkus + Stork/consul round-robin load balancer cache expiration over nodes that are down (Baremetal) Mar 16, 2022
@pjgg pjgg changed the title Quarkus + Stork/consul round-robin load balancer cache expiration over nodes that are down (Baremetal) Quarkus + Stork/consul round-robin service discovery cache expiration over nodes that are down (Baremetal) Mar 16, 2022
@michalszynkiewicz
Copy link
Member

So you'd like Stork to remove a service instance from the ones the client tries if the instance is not available?

@pjgg
Copy link
Contributor Author

pjgg commented Mar 17, 2022

I think that makes sense to check if the service is ready before adding this service to the pool of services, and if is not ready then remove this instance from the pool. By this way, we will avoid unnecessary retries on the client-side.

Current behavior:

Client(with retry policy) -> stork(service discovery + LB) -> service_one
                                                              service_one_replica (down)

In the end, all requests will succeed, but the price is too high (in terms of request and load to the available node). Also, maybe the service is up...but in another k8s node, because...was moved to another node (AWS spot instances or ephemeral nodes(cloud)). In those cases, the new service in another node is going to be registered again, but the old one is going to remain also in the pool.

If by configuration could be possible to let Quarkus/stork know where the readiness URL is located, then this config could be used by Quarkus/stork when a service is added or stork.yourService.service-discovery.refresh-period expires, and avoid these unnecessary retries.

@michalszynkiewicz
Copy link
Member

It is out of scope of Stork to perform heartbeat for service instances.
Service discovery solutions, such as Consul, can do it, and Stork can ask Consul for healthy services only (there is a configuration property for it).

Also, it is possible for a load balancer add an instnace to some kind of block list after a failure. Out of existing load balancers, the least-response-time one is closest to doing it but it just treats failures as a very long response time, so that less requests are directed to such an instance.

Does this answer your doubts?

@pjgg pjgg closed this as completed Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stork kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants